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Abstract 

The  Tall-Skinny  QR  (TSQR)  algorithm  is  more  communication  efficient  than  the  standard  House¬ 
holder  algorithm  for  QR  decomposition  of  matrices  with  many  more  rows  than  columns.  However, 
TSQR  produces  a  different  representation  of  the  orthogonal  factor  and  therefore  requires  more  software 
development  to  support  the  new  representation.  Further,  implicitly  applying  the  orthogonal  factor  to 
the  trailing  matrix  in  the  context  of  factoring  a  square  matrix  is  more  complicated  and  costly  than  with 
the  Householder  representation. 

We  show  how  to  perform  TSQR  and  then  reconstruct  the  Householder  vector  representation  with 
the  same  asymptotic  communication  efficiency  and  little  extra  computational  cost.  We  demonstrate 
the  high  performance  and  numerical  stability  of  this  algorithm  both  theoretically  and  empirically.  The 
new  Householder  reconstruction  algorithm  allows  us  to  design  more  efficient  parallel  QR  algorithms, 
with  significantly  lower  latency  cost  compared  to  Householder  QR  and  lower  bandwidth  and  latency 
costs  compared  with  Communication- Avoiding  QR  (CAQR)  algorithm.  As  a  result,  our  final  parallel 
QR  algorithm  outperforms  ScaLAPACK  and  Elemental  implementations  of  Householder  QR  and  our 
implementation  of  CAQR  on  the  Hopper  Cray  XE6  NERSC  system. 


1  Introduction 

Because  of  the  rising  costs  of  communication  (i.e.,  data  movement)  relative  to  computation,  so-called 
communication-avoiding  algorithms — ones  that  perform  roughly  the  same  computation  as  alternatives 
but  significantly  reduce  communication — often  run  with  reduced  running  times  on  today’s  architectures. 
In  particular,  the  standard  algorithm  for  computing  the  QR  decomposition  of  a  tall  and  skinny  matrix 
(one  with  many  more  rows  than  columns)  is  often  bottlenecked  by  communication  costs.  A  more  recent 
algorithm  known  as  Tall-Skinny  QR  (TSQR)  is  presented  in  [1]  (the  ideas  go  back  to  [2])  and  overcomes  this 
bottleneck  by  reformulating  the  computation.  In  fact,  the  algorithm  is  communication  optimal,  attaining 
the  lower  bound  for  communication  costs  of  QR  decomposition  (up  to  a  logarithmic  factor  in  the  number 
of  processors)  [3].  Not  only  is  communication  reduced  in  theory,  but  the  algorithm  has  been  demonstrated 
to  perform  better  on  a  variety  of  architectures,  including  multicore  processors  [4],  graphics  processing  units 
[5],  and  distributed- memory  systems  [6]. 

The  standard  algorithm  for  QR  decomposition,  which  is  implemented  in  LAPACK  [7],  ScaLAPACK  [8], 
and  Elemental  [9]  is  known  as  Householder- QR  (given  below  as  Algorithm  1).  For  tall  and  skinny  matrices, 
the  algorithm  works  column-by-column,  computing  a  Householder  vector  and  applying  the  corresponding 
transformation  for  each  column  in  the  matrix.  When  the  matrix  is  distributed  across  a  parallel  machine, 
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this  requires  one  parallel  reduction  per  column.  The  TSQR  algorithm  (given  below  as  Algorithm  2),  on 
the  other  hand,  performs  only  one  reduction  during  the  entire  computation.  Therefore,  TSQR  requires 
asymptotically  less  inter-processor  synchronization  than  Householder-QR  on  parallel  machines  (TSQR  also 
achieves  asymptotically  higher  cache  reuse  on  sequential  machines). 

Computing  the  QR  decomposition  of  a  tall  and  skinny  matrix  is  an  important  kernel  in  many  contexts, 
including  standalone  least  squares  problems,  eigenvalue  and  singular  value  computations,  and  Krylov 
subspace  and  other  iterative  methods.  In  addition,  the  tall  and  skinny  factorization  is  a  standard  building 
block  in  the  computation  of  the  QR  decomposition  of  general  (not  necessarily  tall  and  skinny)  matrices. 
In  particular,  most  algorithms  work  by  factoring  a  tall  and  skinny  panel  of  the  matrix,  applying  the 
orthogonal  factor  to  the  trailing  matrix,  and  then  continuing  on  to  the  next  panel.  Although  Householder- 
QR  is  bottlenecked  by  communication  in  the  panel  factorization,  it  can  apply  the  orthogonal  factor  as  an 
aggregated  Householder  transformation  efficiently,  using  matrix  multiplication  [10]. 

The  Communication-Avoiding  QR  (CAQR)  [1]  algorithm  uses  TSQR  to  factor  each  panel  of  a  general 
matrix.  One  difficulty  faced  by  CAQR  is  that  TSQR  computes  an  orthogonal  factor  that  is  implicitly 
represented  in  a  different  format  than  that  of  Householder-QR.  While  Householder-QR  represents  the 
orthogonal  factor  as  a  set  of  Householder  vectors  (one  per  column),  TSQR  computes  a  tree  of  smaller  sets  of 
Householder  vectors  (though  with  the  same  total  number  of  nonzero  parameters).  In  CAQR,  this  difference 
in  representation  implies  that  the  trailing  matrix  update  is  done  using  the  implicit  tree  representation  rather 
than  matrix  multiplication  as  possible  with  Householder-QR.  From  a  software  engineering  perspective, 
this  means  writing  and  tuning  more  complicated  code.  Furthermore,  from  a  performance  perspective, 
the  update  of  the  trailing  matrix  within  CAQR  is  less  communication  efficient  than  the  update  within 
Householder-QR  by  a  logarithmic  factor  in  the  number  of  processors. 

Building  on  a  method  introduced  by  Yamamoto  [11],  we  show  that  the  standard  Householder  vector 
representation  may  be  recovered  from  the  implicit  TSQR  representation  for  roughly  the  same  cost  as  the 
TSQR  itself.  The  key  idea  is  that  the  Householder  vectors  that  represent  an  orthonormal  matrix  can  be 
computed  via  LU  decomposition  (without  pivoting)  of  the  orthonormal  matrix  subtracted  from  a  diagonal 
sign  matrix.  We  prove  that  this  reconstruction  is  as  numerically  stable  as  Householder-QR  (independent 
of  the  matrix  condition  number),  and  validate  this  proof  with  experimental  results. 

This  reconstruction  method  allows  us  to  get  the  best  of  TSQR  algorithm  (avoiding  synchronization) 
as  well  as  the  best  of  the  Householder-QR  algorithm  (efficient  trailing  matrix  updates  via  matrix  multi¬ 
plication).  By  obtaining  Householder  vectors  from  the  TSQR  representation,  we  can  logically  decouple 
the  block  size  of  the  trailing  matrix  updates  from  the  number  of  columns  in  each  TSQR.  This  abstraction 
makes  it  possible  to  optimize  panel  factorization  and  the  trailing  matrix  updates  independently.  Our  result¬ 
ing  parallel  implementation  outperforms  ScaLAPACK,  Elemental,  and  our  own  lightly-optimized  CAQR 
implementation  on  the  Hopper  Cray  XE6  platform  at  NERSC.  While  we  do  not  experimentally  study 
sequential  performance,  we  expect  our  algorithm  will  also  be  beneficial  in  this  setting,  due  to  the  cache 
efficiency  gained  by  using  TSQR. 

2  Performance  cost  model 

In  this  section,  we  detail  our  algorithmic  performance  cost  model  for  parallel  execution  on  p  processors.  We 
will  use  the  0.-/3- 7  model  which  expresses  algorithmic  costs  in  terms  of  computation  and  communication 
costs,  the  latter  being  composed  of  a  bandwidth  cost  as  well  as  a  latency  cost.  We  assume  the  cost  of  a 
message  of  size  w  words  is  a+/3w,  where  a  is  the  per- message  latency  cost  and  /3  is  the  per- word  bandwidth 
cost.  We  let  7  be  the  cost  per  floating  point  operation.  We  ignore  the  network  topology  and  measure  the 
costs  in  parallel,  so  that  the  cost  of  two  disjoint  pairs  of  processors  communicating  the  same-sized  message 
simultaneously  is  the  same  as  that  of  one  message.  We  also  assume  that  two  processors  can  exchange 
equal-sized  messages  simultaneously,  but  a  processor  can  communicate  with  only  one  other  processor  at  a 
time. 

Our  algorithmic  analysis  will  depend  on  the  costs  of  collective  communication,  particularly  broadcasts, 
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reductions,  and  all-reductions,  and  we  consider  both  tree-based  and  bidirectional-exchange  (or  recursive 
doubling/halving)  algorithms  [12,  13,  14].  For  arrays  of  size  w  >  p,  the  collectives  scatter,  gather,  all¬ 
gather,  and  reduce-scatter  can  all  be  performed  with  communication  cost 


a  ■  log  p  +  (3  ■ 


p  —  1 


w 


P 


(1) 


(reduce-scatter  also  incurs  a  computational  cost  of  n/-((p—l)/p)w).  Since  a  broadcast  can  be  performed  with 
scatter  and  all-gather,  a  reduction  can  be  performed  with  reduce-scatter  and  gather,  and  an  all-reduction 
can  be  performed  with  reduce-scatter  and  all-gather,  the  communication  costs  of  those  collectives  for  large 
arrays  are  twice  that  of  Equation  (1).  For  small  arrays  (w  <  p),  it  is  not  possible  to  use  pipelined  or 
recursive-doubling  algorithms,  which  require  the  subdivision  of  the  message  into  many  pieces,  therefore 
collectives  such  as  a  binomial  tree  broadcast  must  be  used.  In  this  case,  the  communication  cost  of 
broadcast,  reduction,  and  all-reduction  is  a  ■  logp  +  /3  ■  w  logp. 


3  Previous  Work 

We  distinguish  between  two  types  of  QR  factorization  algorithms.  We  call  an  algorithm  that  distributes 
entire  rows  of  the  matrix  to  processors  a  ID  algorithm.  Such  algorithms  are  often  used  for  tall  and  skinny 
matrices.  Algorithms  which  distribute  the  matrix  across  a  2D  grid  of  pr  x  pc  processors  are  known  as  2D 
algorithms.  Many  right-looking  2D  algorithms  for  QR  decomposition  of  nearly  square  matrices  divide  the 
matrix  into  column  panels  and  work  panel-by-panel,  factoring  the  panel  with  a  ID  algorithm  and  then 
updating  the  trailing  matrix.  We  consider  two  such  existing  algorithms  in  this  section:  2D-Householder-QR 
(using  Householder-QR)  and  CAQR  (using  TSQR). 

3.1  Householder-QR 

We  first  present  Householder-QR  in  Algorithm  1,  following  [15]  so  that  each  Householder  vector  has  a 
unit  diagonal  entry.  We  use  LAPACK  [7]  notation  for  the  scalar  quantities.1  We  present  Algorithm  1  in 
Matlab-style  notation  as  a  sequential  algorithm.  The  algorithm  works  column-by-column,  computing  a 
Householder  vector  and  then  updating  the  trailing  matrix  to  the  right.  The  Householder  vectors  are  stored 
in  an  mxb  lower  triangular  matrix  Y.  Note  that  we  do  not  include  r  as  part  of  the  output  because  it  can 
be  recomputed  from  Y . 

While  the  algorithm  works  for  general  m  and  n,  it  is  most  commonly  used  when  m  3>  n,  such  as  a 
panel  factorization  within  a  square  QR  decomposition.  In  LAPACK  terms,  this  algorithm  corresponds  to 
geqr2  and  is  used  as  a  subroutine  in  geqrf .  In  this  case,  we  also  compute  an  upper  triangular  matrix  T 
so  that 

n 

Q  =  U(i- mml)  =  i-  ytyt , 

2—1 

which  allows  the  application  of  QT  to  the  trailing  matrix  to  be  done  efficiently  using  matrix  multiplication. 
Computing  T  is  done  in  LAPACK  with  larf  t  but  can  also  be  computed  from  YTY  by  solving  the  equation 
Yty  =  T"1  +  T~t  for  T”1  (since  YTY  is  symmetric  and  T  1  is  triangular,  the  off-diagonal  entries  are 
equivalent  and  the  diagonal  entries  differ  by  a  factor  of  2)  [16]. 

3.1.1  ID  Algorithm 

We  will  make  use  of  Householder-QR  as  a  sequential  algorithm,  but  there  are  parallelizations  of  the  al¬ 
gorithm  in  libraries  such  as  ScaLAPACK  [8]  and  Elemental  [9]  against  which  we  will  compare  our  new 
approach.  Assuming  a  ID  distribution  across  p  processors,  the  parallelization  of  Householder-QR  (Algo¬ 
rithm  1)  requires  communication  at  lines  2  and  8,  both  of  which  can  be  performed  using  an  all-reduction. 

1  However,  we  depart  from  the  LAPACK  code  in  that  there  is  no  check  for  a  zero  norm  of  a  subcolumn. 
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Algorithm  1  [Y,  R]  =  Householder- QR  (A) 

Require:  A  is  to  x  6 
1:  for  i  =  1  to  6  do 

%  Compute  the  Householder  vector 
2:  a  =  0  =  \\A(i:m,i)\\2 

3:  if  a  >  0  then 

4:  /3  =  -f3 

5:  end  if 

6:  A(i,i)  =  P,r(i)  =  Pjf 

7:  A(i+l'.m,i)  =  ^73  •  A(i+l:m,i) 

%  Apply  the  Householder  transformation  to  the  trailing  matrix 
8:  z  =  r{i)  ■  [A(i,  i+l:b)  +  A(i+l:m,  i)T  ■  A(i+l\m,  *+1:6)] 

9:  A(i+l:m,  i+l-b)  =  A(i+l:m,i+l:b)  —  A(i+l:m,i)  ■  z 

10:  end  for 

Ensure:  A  =  (H'Lii1  ~  nyiyj))  R 

Ensure:  R  overwrites  the  upper  triangle  and  Y  (the  Householder  vectors)  has  implicit  unit  diagonal  and 
overwrites  the  strict  lower  triangle  of  A;  r  is  an  array  of  length  6  with  r?;  =  2 /{yjyf) 


Because  these  steps  occur  for  each  column  in  the  matrix,  the  total  latency  cost  of  the  algorithm  is  26  log  p. 
This  synchronization  cost  is  a  potential  parallel  scaling  bottleneck,  since  it  grows  with  the  number  of 
columns  of  the  matrix  and  does  not  decrease  with  the  number  of  processors.  The  bandwidth  cost  is  domi¬ 
nated  by  the  all-reduction  of  the  z  vector,  which  has  length  6  —  i  at  the  ith  step.  Thus,  the  bandwidth  cost 
of  Householder-QR  is  (62/2)logp.  The  computation  within  the  algorithm  is  load  balanced  for  a  parallel 
cost  of  (2m62  —  263/3)/p  flops. 


3.1.2  2D  Algorithm 


In  the  context  of  a  2D  algorithm,  in  order  to  perform  an  update  with  the  computed  Householder  vectors, 
we  must  also  compute  the  T  matrix  from  Y  in  parallel.  The  leading  order  cost  of  computing  T_1  from 
YtY  is  mb2 /p  flops  plus  the  cost  of  reducing  a  symmetric  6x6  matrix,  a  ■  logp  +  (3  ■  62/2;  note  that  the 
communication  costs  are  lower  order  terms  compared  to  computing  Y .  We  present  the  costs  of  parallel 
Householder-QR  in  the  first  row  of  Table  1,  combining  the  costs  of  Algorithm  1  with  those  of  computing 
T. 

We  refer  to  the  2D  algorithm  that  uses  Householder-QR  as  the  panel  factorization  as  2D-Householder- 
QR.  In  ScaLAPACK  terms,  this  algorithm  corresponds  to  pxgeqrf .  The  overall  cost  of  2D-Householder- 
QR,  which  includes  panel  factorizations  and  trailing  matrix  updates,  is  given  to  leading  order  by 


6mn6  —  3n26  n2b  2  mn2  —  2n3/3 


+/3- (  n61ogpr  + 


2  mn  —  n2  n 2 


+a- 1  2nlogpr  +  —  logpc 


The  derivation  of  these  costs  is  very  similar  to  that  of  CAQR-HR  (see  Section  4.3).  If  we  pick  pr  =  pc=  \fv 
(assuming  m  ~  n)  and  6  =  n/(y/plogp)  then  we  obtain  the  leading  order  costs 

7  •  (2 mn2  —  2n3 /3)/p  +  f3  ■  {mn  +  n2)/ yfp  +  a  ■  nlogp. 


Note  that  these  costs  match  those  of  [1,  8],  with  exceptions  coming  from  the  use  of  more  efficient  collectives. 
The  choice  of  6  is  made  to  preserve  the  leading  constants  of  the  parallel  computational  cost.  We  present 
the  costs  of  2D-Householder-QR  in  the  first  row  of  Table  2. 


3.2  Communication- Avoiding  QR 

In  this  section  we  present  parallel  Tall-Skinny  QR  (TSQR)  [1,  Algorithm  1]  and  Communication- Avoiding 
QR  (CAQR)  [1,  Algorithm  2],  which  are  algorithms  for  computing  a  QR  decomposition  that  are  more 
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communication  efficient  than  Householder- QR,  particularly  for  tall  and  skinny  matrices. 


3.2.1  ID  Algorithm  (TSQR) 

We  present  a  simplified  version  of  TSQR  in  Algorithm  2:  we  assume  the  number  of  processors  is  a  power 
of  two  and  use  a  binary  reduction  tree  (TSQR  can  be  performed  with  any  tree).  Note  also  that  we  present 
a  reduction  algorithm  rather  than  an  all-reduction  (he.,  the  final  R  resides  on  only  one  processor  at  the 
end  of  the  algorithm).  TSQR  assumes  the  tall-skinny  matrix  A  is  distributed  in  block  row  layout  so  that 
each  processor  owns  a  (m/p)  x  n  submatrix.  After  each  processor  computes  a  local  QR  factorization  of 
its  submatrix  (line  1),  the  algorithm  works  by  reducing  the  p  remaining  n  x  n  triangles  to  one  final  upper 
triangular  R  =  QT A  (lines  2-10).  The  Q  that  triangularizes  A  is  stored  implicitly  as  a  tree  of  sets  of 
Householder  vectors,  given  by  { Y) In  particular,  {Y^*,}  is  the  set  of  Householder  vectors  computed  by 
processor  i  at  the  kth  level  of  the  tree.  The  ith  leaf  of  tree,  Y);o  is  the  set  of  Householder  vectors  which 
processor  i  computes  by  doing  a  local  QR  on  its  part  of  the  initial  matrix  A. 

In  the  case  of  a  binary  tree,  every  internal  node  of  the  tree  consists  of  a  QR  factorization  of  two  stacked 
6x6  triangles  (line  6).  This  sparsity  structure  can  be  exploited,  saving  a  constant  factor  of  computation 
compared  to  a  QR  factorization  of  a  dense  26  x  6  matrix.  In  fact,  as  of  version  3.4,  LAPACK  has  subroutines 
for  exploiting  this  and  similar  sparsity  structures  (tpqrt).  Furthermore,  the  Householder  vectors  generated 
during  the  QR  factorization  of  stacked  triangles  have  similar  sparsity;  the  structure  of  the  Y*  &  for  k  >  0  is 
an  identity  matrix  stacked  on  top  of  a  triangle. 

Because  the  set  {Y),fc}  constitutes  the  implicit  tree  representation  of  the  orthogonal  factor,  it  is  impor¬ 
tant  to  note  how  the  tree  is  distributed  across  processors  since  the  Q  factor  will  be  either  applied  to  a 
matrix  (see  [1,  Algorithm  2])  or  constructed  explicitly  (see  Section  3.2.3).  For  a  given  Y)^,  i  indicates  the 
processor  number  (both  where  it  is  computed  and  where  it  is  stored),  and  k  indicates  the  level  in  the  tree. 
Although  0  <  i  <  p  —  1  and  0  <  k  <  logp,  many  of  the  Y)^  are  null;  for  example,  Yi,iogp  =  0  for  i  >  0  and 
Ip— i,fc 

for  0  <  k  <  logp  and  thus  requires  0(b2  logp)  extra  memory. 

The  costs  and  analysis  of  TSQR  are  given  in  [1,  17]: 


p-i}k  =  v  for  k  >  0.  In  the  case  of  the  binary  tree  in  Algorithm  2,  processor  0  computes  and  stores  Yq  ^. 


/  2m62  263  \  0  /  62 

7  '  I  — +  ~y  lo&P J  +  P  '  (  T  logp 


+  a  ■  logp. 


We  tabulate  these  costs  in  the  second  row  of  Table  1.  We  note  that  the  TSQR  inner  tree  factorizations 
require  an  extra  computational  cost  0(63logp)  and  a  bandwidth  cost  of  0(b2  logp).  Also  note  that  in  the 
context  of  a  2D  algorithm,  using  TSQR  as  the  panel  factorization  implies  that  there  is  no  6  x  6  T  matrix 
to  compute;  the  update  of  the  trailing  matrix  is  performed  differently. 


3.2.2  2D  Algorithm  (CAQR) 

The  2D  algorithm  that  uses  TSQR  for  panel  factorizations  is  known  as  CAQR.  In  order  to  update  the 
trailing  matrix  within  CAQR,  the  implicit  orthogonal  factor  computed  from  TSQR  needs  to  be  applied  as 
a  tree  in  the  same  order  as  it  was  computed.  See  [1,  Algorithm  2]  for  a  description  of  this  process,  or  see 
[18,  Algorithm  4]  for  pseudocode  that  matches  the  binary  tree  in  Algorithm  2.  We  refer  to  this  application 
of  implicit  QT  as  Apply-TSQR-Qr.  The  algorithm  has  the  same  tree  dependency  flow  structure  as  TSQR 
but  requires  a  bidirectional  exchange  between  paired  nodes  in  the  tree.  We  note  that  in  internal  nodes  of 
the  tree  it  is  possible  to  exploit  the  additional  sparsity  structure  of  Y)^  (an  identity  matrix  stacked  on  top 
of  a  triangular  matrix),  which  our  implementation  does  via  the  use  of  the  LAPACK  v3.4+  routine  tpmqrt. 

Further,  since  A  is  rn  x  n  and  intermediate  values  of  rows  of  A  are  communicated,  the  trailing  matrix 
update  costs  more  than  TSQR  when  n  >  6.  In  the  context  of  CAQR  on  a  square  matrix,  Apply-TSQR-Qr 
is  performed  on  a  trailing  matrix  with  n  ~  m  columns.  The  extra  work  in  the  application  of  the  inner 
leaves  of  the  tree  is  proportional  to  0(n2b\og(p)  /  y/p)  and  bandwidth  cost  proportional  to  0(n2  log (p)/ y/p). 
Since  the  cost  of  Apply-TSQR-QTis  almost  leading  order  in  CAQR,  it  is  desirable  in  practice  to  optimize 
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Algorithm  2  [{Yi^},R]  =  TSQR(A) 

Require:  Number  of  processors  is  p  and  i  is  the  processor  index 
Require:  A  is  rn  x  b  matrix  distributed  in  block  row  layout;  A,  is  processor  i’s  block 
1:  [Y-,0,  Ri]  =  Householder-QR(Aj) 

2:  for  k  =  1  to  [logp]  do 

3:  if  i  =  0  mod  2k  and  i  +  2k~1  <  p  then 

4:  j  =  i  +  2k~1 

5:  Receive  Rj  from  processor  j 

6:  [Yi,k,  R%]  =  Householder-QR  ^  ^ 

7:  else  if  i  =  2k  i  mod  2k  then 

8:  Send  R{  to  processor  i  —  2fc_1 

9:  end  if 

10:  end  for 
11:  if  i  =  0  then 
12:  R  =  R0 

13:  end  if 

Ensure:  A  =  QR  with  Q  implicitly  represented  by  {1^} 

Ensure:  R  is  stored  by  processor  0  and  Y%  f..  is  stored  by  processor  i 


the  update  routine.  However,  the  tree  dependency  structure  complicates  this  manual  developer  or  compiler 
optimization  task. 

The  overall  cost  of  CAQR  is  given  to  leading  order  by 


7  • 
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See  [1]  for  a  discussion  of  these  costs  and  [17]  for  detailed  analysis.  Note  that  the  bandwidth  cost  is  slightly 
lower  here  due  to  the  use  of  more  efficient  broadcasts.  If  we  pick  pr  =  pc  =  y/p  (assuming  m  «  n)  and 
b  =  ' 1  2  then  we  obtain  the  leading  order  costs 

v^log  p 


2mv?  —  2n3/3 


2 mn  +  n2  logp 

VP 


+  «•  (  2^log  p 


Again,  we  choose  b  to  preserve  the  leading  constants  of  the  computational  cost.  Note  that  b  needs  to  be 
chosen  smaller  here  than  in  Section  3.1.2  due  to  the  costs  associated  with  Apply-TSQR-QT. 

It  is  possible  to  reduce  the  costs  of  Apply-TSQR-Qr  further  using  ideas  from  efficient  recursive  dou¬ 
bling/halving  collectives.  See  Appendix  A  for  details. 


3.2.3  Constructing  Explicit  Q  from  TSQR 

In  many  use  cases  of  QR  decomposition,  an  explicit  orthogonal  factor  is  not  necessary;  rather,  we  need 
only  the  ability  to  apply  the  matrix  (or  its  transpose)  to  another  matrix,  as  done  in  the  previous  section. 
For  our  purposes  (see  Section  4),  we  will  need  to  form  the  explicit  m  X  b  orthonormal  matrix  from  the 
implicit  tree  representation.2  Though  it  is  not  necessary  within  CAQR,  we  describe  it  here  because  it  is  a 
known  algorithm  (see  [19,  Figure  4])  and  the  structure  of  the  algorithm  is  very  similar  to  TSQR. 

2In  LAPACK  terms,  constructing  ( i.e generating)  the  orthogonal  factor  when  it  is  stored  as  a  set  of  Householder  vectors 
is  done  with  orgqr. 
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Algorithm  3  [R]  =  Apply-TSQR-QT({Yjfc},  A) 

Require:  Number  of  processors  is  p  and  i  is  the  processor  index 

Require:  A  is  rn  x  n  matrix  distributed  in  block  row  layout;  Ai  is  processor  i’s  block 

Require:  is  the  implicit  tree  TSQR  representation  of  b  Householder  vectors  of  length  m. 

1:  Bi  =  Apply-Householder-(3T(y’j)o,  A,;) 

2:  Let  Bi  be  the  first  b  rows  of  B, 

3:  for  k  =  1  to  |~logp]  do 

4:  if  i  =  0  mod  2k  and  i  +  2k _1  <  p  then 

5:  j  =  i  +  2k_~ 1 

6:  Receive  Bj  from  processor  j 

7:  ^  =  Apply- Householder-^7  (y^,  ^  ^ 

8:  Send  Bj  back  to  processor  j 

9:  else  if  i  =  2fc_1  mod  2k  then 

10:  Send  Bi  to  processor  i  —  2fc_1 

11:  Receive  updated  rows  Bi  from  processor  i  —  2k~1 

12:  Set  the  first  b  rows  of  Bi  to  Bi 

13:  end  if 

14:  end  for 

Ensure:  B  =  QT A  with  processor  i  owning  block  Bi ,  where  Q  is  the  orthogonal  matrix  implicitly  repre¬ 
sented  by  {Yi)k] 

Algorithm  4  presents  the  method  for  constructing  the  mxb  matrix  Q  by  applying  the  (implicit)  square 
orthogonal  factor  to  the  first  b  columns  of  the  mxm  identity  matrix.  Note  that  while  we  present  Algorithm 
4  assuming  a  binary  tree,  any  tree  shape  is  possible,  as  long  as  the  implicit  Q  is  computed  using  the  same 
tree  shape  as  TSQR.  While  the  nodes  of  the  tree  are  computed  from  leaves  to  root,  they  will  be  applied  in 
reverse  order  from  root  to  leaves.  Note  that  in  order  to  minimize  the  computational  cost,  the  sparsity  of 
the  identity  matrix  at  the  root  node  and  the  sparsity  structure  of  { Y) at  the  inner  tree  nodes  is  exploited. 
In  particular,  since  If,  is  upper  triangular  and  T0  |-logp-|  has  the  structure  of  identity  stacked  on  top  of  an 
upper  triangle,  the  output  of  the  root  node  application  in  line  7  is  a  stack  of  two  upper  triangles.  By 
induction,  since  every  Y)  *.  for  k  >  0  has  the  same  structure,  all  output  Q,  ^  are  triangular  matrices.  At 
the  leaf  nodes,  when  17, o  is  applied  in  line  13,  the  output  is  a  dense  {m/p)  x  b  block. 

Since  the  communicated  matrices  Qj  are  triangular  just  as  Rj  was  triangular  in  the  TSQR  algorithm, 
Construct-TSQR-Q  incurs  the  exact  same  computational  and  communication  costs  as  TSQR.  So,  we  can 
reconstruct  the  unique  part  of  the  Q  matrix  from  the  implicit  form  given  by  TSQR  for  the  same  cost  as 
the  TSQR  itself. 

3.3  Yamamoto’s  Basis-Kernel  Representation 

The  main  goal  of  this  work  is  to  combine  Householder-QR  with  CAQR;  Yamamoto  [11]  proposes  a  scheme 
to  achieve  this.  As  described  in  Section  3.1,  2D-Householder-QR  suffers  from  a  communication  bottleneck 
in  the  panel  factorization.  TSQR  alleviates  that  bottleneck  but  requires  a  more  complicated  (and  slightly 
less  efficient)  trailing  matrix  update.  Motivated  in  part  to  improve  the  performance  and  programmability 
of  a  hybrid  CPU/GPU  implementation,  Yamamoto  suggests  computing  a  representation  of  the  orthogonal 
factor  that  triangularizes  the  panel  that  mimics  the  representation  in  Householder-QR. 

As  described  by  Sun  and  Bischof  [20],  there  are  many  so-called  “basis- kernel”  representations  of  an 
orthogonal  matrix.  See  also  [21]  for  general  discussion  of  block  reflectors.  For  example,  the  Householder- 
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Algorithm  4  Q  =  Construct-TSQR-QdY)^}) 

Require:  Number  of  processors  is  p  and  i  is  the  processor  index 
Require:  {Y[  *.}  is  computed  by  Algorithm  2  so  that  Yj  ^  is  stored  by  processor  i 
1:  if  %  =  0  then 
2:  Qo  =  lb 

3:  end  if 

4:  for  k  =  [logp]  down  to  1  do 
5:  if  i  =  0  mod  2k  and  i  +  2k~l  <  p  then 

6:  j  =  i  +  2k~ 1 

7:  =  Apply-Householder-Q  (  Y^k, 

8:  Send  Qj  to  processor  j 

9:  else  if  i  =  2k  i  mod  2k  then 

10:  Receive  Qt  from  processor  i  —  2k~l 

11:  end  if 

12:  end  for 

13:  Qi  =  Apply-Q-to- Triangle 
Ensure:  Q  is  orthonormal  m  x  b  matrix  distributed  in  block  row  layout;  Qi  is  processor  i’s  block 


QR  algorithm  computes  a  lower  triangular  matrix  Y  such  that  A  =  (I  —  YTY^)R,  so  that 


Q  =  I  -  YTYt  =  I  - 


T[Y 7  Y2t] 


Here,  Y  is  called  the  “basis”  and  T  is  called  the  “kernel”  in  this  representation  of  the  square  orthogonal 
factor  Q.  However,  there  are  many  such  basis-kernel  representations  if  we  do  not  restrict  Y  and  T  to  be 
lower  and  upper  triangular  matrices,  respectively. 

Yamamoto  [11]  chooses  a  basis- kernel  representation  that  is  easy  to  compute.  For  an  m  x  b  matrix  A, 

let  A  =  ^  R  where  Q\  and  R  are  6x6.  Then  define  the  basis-kernel  representation 
Q  2 


Q  =  I  -  YTYt  =  I  - 


Qi-I 

Q-2 


[i-Qi]  t[(Qi~i)t  QT[, 


~  R 

where  I  —  Q\  is  assumed  to  be  nonsingular.  It  can  be  easily  verified  that  QTQ  =  /  and  QT A  =  ;  in 

fact,  this  is  the  representation  suggested  and  validated  by  [22,  Theorem  3].  Note  that  both  the  basis  and 
kernel  matrices  Y  and  T  are  dense. 

The  main  advantage  of  basis-kernel  representations  is  that  they  can  be  used  to  apply  the  orthogonal 
factor  (or  its  transpose)  very  efficiently  using  matrix  multiplication.  In  particular,  the  computational 
complexity  of  applying  QT  using  any  basis-kernel  is  the  same  to  leading  order,  assuming  Y  has  the  same 
dimensions  as  A  and  m.  3>  6.  Thus,  it  is  not  necessary  to  reconstruct  the  Householder  vectors;  from  a 
computational  perspective,  finding  any  basis-kernel  representation  of  the  orthogonal  factor  computed  by 
TSQR  will  do.  Note  also  that  in  order  to  apply  QT  with  the  representation  in  Equation  (3),  we  need  to 
apply  the  inverse  of  I  —  Qi,  so  we  need  to  perform  an  LU  decomposition  of  the  6x6  matrix  and  then  apply 
the  inverses  of  the  triangular  factors  using  triangular  solves. 

The  assumption  that  I  —  Qi  is  nonsingular  can  be  dropped  by  replacing  I  with  a  diagonal  sign  matrix 
S  chosen  so  that  S  —  Q  i  is  nonsingular  [23];  in  this  case  the  representation  becomes 


QS  =  I-  YTYt  =  1- 


QiQ2S  5^-Qi]  T[(Qi-S)t  Ql 


Yamamoto’s  approach  is  very  closely  related  to  TSQR-HR  (Algorithm  6),  presented  in  Section  4.  We 
compare  the  methods  in  Section  4.1. 

4  New  Algorithms 

We  first  present  our  main  contribution,  a  parallel  algorithm  that  performs  TSQR  and  then  reconstructs  the 
Householder  vector  representation  from  the  TSQR  representation  of  the  orthogonal  factor.  We  then  show 
that  this  reconstruction  algorithm  may  be  used  as  a  building  block  for  more  efficient  2D  QR  algorithms.  In 
particular,  the  algorithm  is  able  to  combine  two  existing  approaches  for  2D  QR  factorizations,  leveraging 
the  efficiency  of  TSQR  in  panel  factorizations  and  the  efficiency  of  Householder-QR  in  trailing  matrix 
updates.  While  Householder  reconstruction  adds  some  extra  cost  to  the  panel  factorization,  we  show  that 
its  use  in  the  2D  algorithm  reduces  overall  communication  compared  to  both  2D-Householder-QR  and 
CAQR. 

4.1  TSQR  with  Householder  Reconstruction 

The  basic  steps  of  our  ID  algorithm  include  performing  TSQR,  constructing  the  explicit  tall-skinny  Q 
factor,  and  then  computing  the  Householder  vectors  corresponding  to  Q.  The  key  idea  of  our  reconstruction 
algorithm  is  that  performing  Householder-QR  on  an  orthonormal  matrix  Q  is  the  same  as  performing  an 
LU  decomposition  on  Q  —  5,  where  S  is  a  diagonal  sign  matrix  corresponding  to  the  sign  choices  made 
inside  the  Householder-QR  algorithm.  This  claim  is  proved  explicitly  in  Lemma  5.2.  Informally,  ignoring 
signs,  if  Q  =  I  —  YTY ^  with  Y  a  matrix  of  Householder  vectors,  then  Y  •  (— TY (Q  is  an  LU  decomposition 
of  Q  —  I  since  Y  is  unit  lower  triangular  and  TY^  is  upper  triangular. 

In  this  section  we  present  Modified-LU  as  Algorithm  5,  which  can  be  applied  to  any  orthonormal  matrix 
(not  necessarily  one  obtained  from  TSQR).  Ignoring  lines  1,  3,  and  4,  it  is  exactly  LU  decomposition  without 
pivoting.  Note  that  with  the  choice  of  5,  no  pivoting  is  required  since  the  effective  diagonal  entry  will  be  at 
least  1  in  absolute  value  and  all  other  entries  in  the  column  are  bounded  by  1  (the  matrix  is  orthonormal).3 
This  holds  true  throughout  the  entire  factorization  because  the  trailing  matrix  remains  orthonormal,  which 
we  prove  within  Lemma  5.2. 

Algorithm  5  [L,U,S]  =  Modified-LU (Q) 

Require:  Q  is  m  x  b  orthonormal  matrix 
1:  5  =  0 

2:  for  i  =  1  to  b  do 
3:  S(i,i)  =  -sgn{Q(i,i)) 

4:  Q(i,i)  =  Q(i,i)  ~  S(i,i) 

%  Scale  ith  column  of  L  by  diagonal  element 
5:  Q(i+l:m,  i)  =  ■  Q(i+l:m,  i ) 

%  Perform  Schur  complement  update 
6:  Q(i+l:m,i+l:b)  =  Q(i+l:m,i+l:b)—Q(i+l:m,i)-Q(i,i+l:b) 

7:  end  for 

Ensure:  U  overwrites  the  upper  triangle  and  L  has  implicit  unit  diagonal  and  overwrites  the  strict  lower 
triangle  of  Q]  S  is  diagonal  so  that  Q  —  5  =  LU 


Given  the  algorithms  of  the  previous  sections,  we  now  present  the  full  approach  for  computing  the 
QR  decomposition  of  a  tall-skinny  matrix  using  TSQR  and  Householder  reconstruction.  That  is,  in  this 
section  we  present  an  algorithm  such  that  the  format  of  the  output  of  the  algorithm  is  identical  to  that 
of  Householder-QR.  However,  we  argue  that  the  communication  costs  of  this  approach  are  much  less  than 
those  of  performing  Householder-QR. 

JWe  use  the  convention  sgn(0)  =  1. 
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Flops  Words  Messages 

Householder-QR 

TSQR 

TSQR-HR 

|log  p  26  log  p 

~pT“  +  TT  log  P  floS  P  P 

^  +  ^logp  62logp  41ogp 

Table  1:  Costs  of  QR  factorization  of  tall-skinny  m  x  6  matrix  distributed  over  p  processors  in  ID  fashion. 
We  assume  these  algorithms  are  used  as  panel  factorizations  in  the  context  of  a  2D  algorithm  applied  to 
an  m  x  n  matrix.  Thus,  costs  of  Householder-QR  and  TSQR-HR  include  the  costs  of  computing  T. 

The  method,  given  as  Algorithm  6,  is  to  perform  TSQR  (line  1),  construct  the  tall-skinny  Q  factor 
explicitly  (line  2),  and  then  compute  the  Householder  vectors  that  represent  that  orthogonal  factor  using 
Modified-LU  (line  3).  The  R  factor  is  computed  in  line  1  and  the  Householder  vectors  (the  columns  of  Y) 
are  computed  in  line  3.  An  added  benefit  of  the  approach  is  that  the  triangular  T  matrix,  which  allows  for 
block  application  of  the  Householder  vectors,  can  be  computed  very  cheaply.  That  is,  a  triangular  solve 
involving  the  upper  triangular  factor  from  Modified-LU  computes  the  T  so  that  A  =  (I  —  YTY^)R.  To 
compute  T  directly  from  Y  (as  is  necessary  if  Householder-QR  is  used)  requires  0(nb2)  flops;  here  the 
triangular  solve  involves  0{b 3)  flops.  Our  approach  for  computing  T  is  given  in  line  4,  and  line  5  ensures 
sign  agreement  between  the  columns  of  the  (implicitly  stored)  orthogonal  factor  and  rows  of  R. 


Algorithm  6  [Y,  T,  R]  =  TSQR-HR(A) 

Require:  A  is  m  x  b  matrix  distributed  in  block  row  layout 
1:  [{Yi!k},R}  =TSQR(A) 

2:  Q  =  C  ons  t  met- T  S  QR  -  Q({Ylk}) 

3:  [Y,  U,  S }  =  Modified-LU(Q) 

4:  T  =  -USY{t 
5:  R  =  SR 

Ensure:  A  =  (I  —  YTY^)R.  where  Y  is  m  x  b  and  unit  lower  triangular,  Y\  is  top  6x6  block  of  Y,  and 
T  and  R  are  6x6  and  upper  triangular 


On  p  processors,  Algorithm  6  incurs  the  following  costs  (ignoring  lower  order  terms): 

1.  Compute  [{Yitk},R']  =  TSQR(A) 

The  computational  costs  of  this  step  come  from  lines  1  and  6  in  Algorithm  2.  Line  1  corresponds 
to  a  QR  factorization  of  a  (m/p)  x  6  matrix,  with  a  flop  count  of  2(m/p)b2  —  263  /3  (each  processor 
performs  this  step  simultaneously).  Line  6  corresponds  to  a  QR  factorization  of  a  6  x  6  triangle 
stacked  on  top  of  a  6  x  6  triangle.  Exploiting  the  sparsity  structure,  the  flop  count  is  263/3;  this 
occurs  at  every  internal  node  of  the  binary  tree,  so  the  total  cost  in  parallel  is  (263/3)  logp. 

The  communication  costs  of  Algorithm  2  occur  in  lines  5  and  8.  Since  every  R,  k  is  a  6  x  6  upper 
triangular  matrix,  the  cost  of  a  single  message  is  a  +  (3  ■  (62/2).  This  occurs  at  every  internal  node 
in  the  tree,  so  the  total  communication  cost  in  parallel  is  a  factor  logp  larger. 

Thus,  the  cost  of  this  step  is 

( 2m62  263  \  /  62  \ 

7  '  I  — — h  ~y  l°£P  )  +  P  •  (  —  logp  I  +  a  ■  log p. 

2.  Q  =  Construct-TSQR-Q ( { Ytj,. } ) 

The  computational  costs  of  this  step  come  from  lines  7  and  13  in  Algorithm  4.  Note  that  for  k  >  0, 
is  a  26  x  6  matrix:  the  identity  matrix  stacked  on  top  of  an  upper  triangular  matrix.  Furthermore, 
Qitk  is  an  upper  triangular  matrix.  Exploiting  the  structure  of  and  Qitk,  the  cost  of  line  7  is 
263/3,  which  occurs  at  every  internal  node  of  the  tree.  Each  Y, o  is  a  (m/p)  x  6  lower  triangular 
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matrix  of  Householder  vectors,  so  the  cost  of  applying  them  to  an  upper  triangular  matrix  in  line  13 
is  2 (m/p)b2  —  263/3.  Note  that  the  computational  cost  of  these  two  lines  is  the  same  as  those  of  the 
previous  step  in  Algorithm  2. 

The  communication  pattern  of  Algorithm  4  is  identical  to  Algorithm  2,  so  the  communication  cost 
is  also  the  same  as  that  of  the  previous  step. 

Thus,  the  cost  of  constructing  Q  is 

( 2  mb2  2  b3  \  /& 2  \ 

7  •  I  — — I-  ~Y  loSP  )  +  ^  '  l  ~2  l°SP )  +  a  '  l°gP' 


3.  [Y,  U,  S }  =  Modihed-LU(Q) 

Ignoring  lines  3-4  in  Algorithm  5,  Modihed-LU  is  the  same  as  LU  without  pivoting.  In  parallel,  the 
algorithm  consists  of  a  b  x  b  (modified)  LU  factorization  of  the  top  block  followed  by  parallel  triangular 
solves  to  compute  the  rest  of  the  lower  triangular  factor.  The  flop  count  of  the  bxb  LU  factorization 
is  263/3,  and  the  cost  of  each  processor’s  triangular  solve  is  ( m/p)b 2.  The  communication  cost  of 
parallel  Modihed-LU  is  only  that  of  a  broadcast  of  the  upper  triangular  b  x  b  U  factor  (for  which  we 
use  a  bidirectional-exchange  algorithm):  f3  ■  ( b 2)  +  a  ■  (2 log p). 

Thus,  the  cost  of  this  step  is 


7  ' 


+  /3  ■  b2  +  a  ■  2  logp 


4.  T  =  -USY/~T  and  R  =  SR' 

The  last  two  steps  consist  of  computing  T  and  scaling  the  final  R  appropriately.  Since  S  is  a  sign 
matrix,  computing  US  and  SR'  requires  no  floating  point  operations  and  can  be  done  locally  on  one 
processor.  Thus,  the  only  computational  cost  is  performing  a  bxb  triangular  solve  involving  . 
If  we  ignore  the  triangular  structure  of  the  output,  the  hop  count  of  this  operation  is  b3.  However, 
since  this  operation  occurs  on  the  same  processor  that  computes  the  top  m/p  x  b  block  of  Y  and 
U,  it  can  be  overlapped  with  the  previous  step  (Modihed-LU).  After  the  top  processor  performs  the 
bxb  LU  factorization  and  broadcasts  the  U  factor,  it  computes  only  m/p  —  b  rows  of  Y  (all  other 
processors  update  m/p  rows).  Thus,  performing  an  extra  bxb  triangular  solve  on  the  top  processor 
adds  no  computational  cost  to  the  critical  path  of  the  algorithm. 


Thus,  TSQR-HR(A)  where  A  is  m-by-b  incurs  the  following  costs  (ignoring  lower  order  terms): 


7- 


+  /3  •  (b2  log p)  +  a  ■  (4  logp) . 


(5) 


Note  that  the  LU  factorization  required  in  Yamamoto’s  approach  (see  Section  3.3)  is  equivalent  to 
performing  Modihed-LU(— Qi).  In  Algorithm  6,  the  Modihed-LU  algorithm  is  applied  to  an  mxb  matrix 
rather  than  to  only  the  top  bxb  block;  since  no  pivoting  is  required,  the  only  difference  is  the  update 
of  the  bottom  m  —  b  rows  with  a  triangular  solve.  Thus  it  is  not  hard  to  see  that,  ignoring  signs,  the 
Householder  basis-kernel  representation  in  Equation  (2)  can  be  obtained  from  the  representation  given  in 
Equation  (3)  with  two  triangular  solves:  if  the  LU  factorization  gives  I  —  Q i  =  LU,  then  Y  =  YU~2  and 
T  =  UL~T.  Indeed,  performing  these  two  operations  and  handling  the  signs  correctly  gives  Algorithm 
6.  While  Yamamoto’s  approach  avoids  performing  the  triangular  solve  on  Q 2,  it  still  involves  performing 
both  TSQR  and  Construct-TSQR-Q. 
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Flops  Words  Messages 

2D-Householder-QR 

CAQR 

CAQR-HR 

2mnz—2n6/Z  2mn-\-nz/2  i 

P  Yv  nlogp 

2mn2-2n3/3  2mn+n2  logp  7  /=in„3 

P  y/p  2V^iU&  V 

2mn2— 2n3/3  2mn+n2/2  r  /—  1  2 

P  Yp  6^log  p 

Table  2:  Costs  of  QR  factorization  of  m  x  n  matrix  distributed  over  p  processors  in  2D  fashion.  Here  we 
assume  a  square  processor  grid  (pr  =  pc) .  We  also  choose  block  sizes  for  each  algorithm  independently  to 
ensure  the  leading  order  terms  for  flops  are  identical. 


4.2  LU  Decomposition  of  A  —  R 


Computing  the  Householder  vectors  basis-kernel  representation  via  Algorithm  6  requires  forming  the  ex¬ 
plicit  m  x  b  orthonormal  matrix.  It  is  possible  to  compute  this  representation  more  cheaply,  though  at 

M 


the  expense  of  losing  numerical  stability.  Suppose  A  is  full  rank,  R  = 


0 


is  the  upper  triangular  factor 


computed  by  TSQR,  and  A  —  R  is  also  full  rank.  Then  A  =  (/  —  YTY^)R.\  implies  that  the  unique  LU 
decomposition  of  A  —  R  is  given  by 


A-  R  =  Y(-TY?R1), 


(6) 


since  Y  is  unit  lower  triangular  and  T,  Yf  ,  and  Ri  are  all  upper  triangular.  The  assumption  that  A  —  R 
is  full  rank  can  be  dropped  by  choosing  an  appropriate  diagonal  sign  matrix  S  and  computing  the  LU 
\SR1] 


decomposition  of  A  — 


0 


Not  surprisingly,  choosing  the  signs  to  match  the  choices  of  the  Householder-QR  algorithm  applied  to 
A  guarantees  that  the  matrix  is  nonsingular.  In  fact,  the  signs  can  be  computed  during  the  course  of  a 
modified  LU  decomposition  algorithm  very  similar  to  Algorithm  5  (designed  for  general  matrices  rather 
than  only  orthonormal).  This  more  general  algorithm  can  also  be  interpreted  as  performing  Householder- 
QR  on  A  but  using  R  to  provide  “hints”  to  the  algorithm  to  avoid  certain  expensive  computations;  see 
Appendix  B  for  more  details. 

The  advantage  of  this  approach  is  that  Y  can  be  computed  without  constructing  Q  explicitly,  which 
is  as  expensive  as  TSQR  itself.  Unfortunately,  this  method  is  not  numerically  stable.  An  intuition  for 
the  instability  comes  from  considering  a  low-rank  matrix  A.  Because  the  QR  decomposition  of  A  is 
not  unique,  the  TSQR  algorithm  and  the  Householder-QR  algorithm  may  compute  different  factor  pairs 
(QtsqRi  -Rtsqr)  and  (Qmu-Rffli)-  Performing  an  LU  decomposition  using  A  and  Rtsqr  to  recover  the 
Householder  vectors  that  represent  Qnh  is  hopeless  if  Rtsqr  /  -Rfflu  even  in  exact  arithmetic.  In  floating 
point  arithmetic,  an  ill-conditioned  A  corresponds  to  a  factor  R  which  is  sensitive  to  roundoff  error,  so 
reconstructing  the  Householder  vectors  that  would  produce  the  same  R  as  computed  by  TSQR  is  unstable. 
However,  if  A  is  well-conditioned,  then  this  method  is  sufficient;  one  can  view  Algorithm  5  as  applying  the 
method  to  an  orthonormal  matrix  (which  is  perfectly  conditioned). 

One  method  of  decreasing  the  sensitivity  of  R  is  to  employ  column  pivoting.  That  is,  after  performing 
TSQR  apply  QR  with  column  pivoting  to  the  b  x  b  upper  triangular  factor  R,  yielding  a  factorization 
A  =  Qtsqr{QRP),  or  AP  =  (QtsqrQ)R,  where  the  diagonal  entries  of  R  decrease  monotonically  in 


absolute  value.  We  have  observed  experimentally  that  performing  LU  decomposition  of  AP  — 


is  a 


more  stable  operation  than  not  using  column  pivoting,  though  it  is  still  not  as  robust  as  Algorit 
Section  5  for  a  discussion  of  experiments  where  the  instability  occurs  (even  with  column  pivoting). 


SR 
0 

un  6.  See 


4.3  CAQR-HR 

We  refer  to  the  2D  algorithm  that  uses  TSQR-HR  for  panel  factorizations  as  CAQR-HR.  Because  Householder- 
QR  and  TSQR-HR  generate  the  same  representation  as  output  of  the  panel  factorization,  the  trailing  matrix 


12 


Algorithm  7  [Y,  T,  R]  =  CAQR-HR(A) 

Require:  Aismxrj  and  distributed  block-cyclically  on  p  =  pr  ■  pc  processors  with  block  size  b,  so  that 
each  b  x  b  block  At]  is  owned  by  processor  Il(i,  j)  =  (i  modpr)  +  pr  ■  (j  mod  pc) 

X:  for  i  =  0  to  n/b  —  1  do 

%  Compute  TSQR  and  reconstruct  Householder  representation  using  column  of  pr  processors 
2:  [Yi-.m/b-i^T^Ra]  =  Hh-Recon-TSQR(Aj.m/6_M) 

%  Update  trailing  matrix  using  all  p  processors 
3:  II (i,i)  broadcasts  Tt  to  all  other  processors 

4:  for  r  6  [i,  m/b  —  1],  c  G  [i  +  1  ,n/b  —  1]  do  in  parallel 

5:  n(r,  i)  broadcasts  Yri  to  all  processors  in  its  row,  :) 

6:  n(r,  c)  computes  Wrc  =  Y?  ■  Arc 

7:  Allreduce  Wc  =  Y/,r  Wrc  so  that  processor  column  II(:,c)  owns  Wc 

8:  n(r,  c)  computes  Arc  =  Arc  —  Yri  ■  Tj  ■  Wc 

9:  Set  Ric  =  Aic 

10:  end  for 

11:  end  for 

Ensure:  A  =  -  Y^Y^  R 

update  can  be  performed  in  exactly  the  same  way.  Thus,  the  only  difference  between  2D-Householder-QR 
and  CAQR-HR,  presented  in  Algorithm  7,  is  the  subroutine  call  for  the  panel  factorization  (line  2). 

Algorithm  7  with  block  size  b,  and  matrix  ?n-by-n  matrix  A,  with  m  >  n  and  m,n  =  0  (mod  b) 
distributed  on  a  2D  pr-by-pc  processor  grid  incurs  the  following  costs  over  all  n/b  iterations., 

1.  Compute  [Yi:m/b_i:i,Ti,  Ra\  =  Hh-Recon-TSQR  (Ai:rn/b_lti) 

Equation  (5)  gives  the  cost  of  a  single  panel  TSQR  factorization  with  Householder  reconstruction. 
We  can  sum  over  all  iterations  to  obtain  the  cost  of  this  step  in  the  2D  QR  algorithm  (line  2  in 
Algorithm  7), 

Xj  ^7  •  lb^b  +  wj-  log  Pr^j  +  P  ■  {b2  log  Pr )  +  a  ■  (41ogpr)^  = 

( 5 mnb  5 n2b  Anb2  ,  \  „  .  ,  ,  .  (An log pr \ 

~f—-  —  +  —logPrj+l3.(nMogPr)  +  af-^) 

2.  n(i,  i)  broadcasts  Ti  to  all  other  processors 

The  matrix  T*  is  b-by-b  and  triangular,  so  we  use  a  bidirectional  exchange  broadcast.  Since  there  are 
a  total  of  n/b  iterations,  the  total  communication  cost  for  this  step  of  Algorithm  7  is 

/3  ■  (nb)  +  a  ■  logp 

3.  n(r,  i)  broadcasts  Yri  to  all  processors  in  its  row,  n(r, :) 

At  this  step,  each  processor  which  took  part  in  the  TSQR  and  Householder  reconstruction  sends 
its  local  chunk  of  the  panel  of  Householder  vectors  to  the  rest  of  the  processors.  At  iteration  i  of 
Algorithm  7,  each  processor  owns  at  most  blocks  Yr{.  Since  all  the  broadcasts  happen  along 

processor  rows,  we  assume  they  can  be  done  simultaneously  on  the  network.  The  communication 
along  the  critical  path  is  then  given  by 
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n/b—1 

£ 


i=0 


+  a  ■  2  log  p 


P- 


(  2mn  —  ?r2 ' 

I  -  I  +  ol 

\  Pr 


A.  II(r,  c)  computes  Wrc  =  Y?  ■  Arc 

Each  block  Wrc  is  computed  on  processor  n(r,  c),  using  Arc,  which  it  owns,  and  Yri ,  which  it  just 
received  from  processor  At  iteration  i,  processor  j  may  own  up  to  \m/r  •  [~  ']  blocks 

Wrc,  n(r,  c)  =  j.  Each,  block-by-block  multiply  incurs  263  flops,  therefore,  this  step  incurs  a  total 
computational  cost  of 


7 


'n/b—1 

£ 

i=0 


2(m/b  —  i){n/b  —  i)b3 


V 


=  7 


mn 2  —  n3/  3 


5.  Allreduce  Wc  =  Wrc  so  that  processor  column  n(:,c)  owns  Wc 

At  iteration  i  of  Algorithm  7,  each  processor  j  may  own  up  to  ■  \n^~'l~\  blocks  Wrc.  The 

local  part  of  the  reduction  can  be  done  during  the  computation  of  Wrc  on  line  6  of  Algorithm  7. 
Therefore,  no  process  should  contribute  more  than  [n/^~^]  6-by-6  blocks  Wrc  to  the  reduction.  Using 
a  bidirectional  exchange  all-reduction  algorithm  and  summing  over  the  iterations,  we  can  obtain  the 
cost  of  this  step  throughout  the  entire  execution  of  the  2D  algorithm: 


n/b—1 

£ 


i=0 


+  a  ■  2  logp 


6.  II(r,  c)  computes  Arc  =  Arc  —  Yri  ■  T%  ■  Wc 

Since  in  our  case,  m>  n,  it  is  faster  to  first  multiply  T;L  by  Wc  rather  than  Yri  by  Tj.  Any  processor 
j  may  update  up  to  [ ■  \  blocks  of  Arc  using  [ blocks  of  Yri  and  [  blocks  of 

Wc.  When  multiplying  T;  by  Wc,  we  can  exploit  the  triangular  structure  of  T,  to  lower  the  flop  count 
by  a  factor  of  two.  Summed  over  all  iterations,  these  two  multiplications  incur  a  computational  cost 
of 


|  2 (m/b  —  i)(n/b  —  i)b 3  ( n/b  —  i)b 3 

^  l  p  p 

i= o 


=  7‘ 


mn2  —  n3/2>  n2b 
P  +  2  pc 


The  overall  costs  of  CAQR-HR  are  given  to  leading  order  by 

7,( 


/ lOmnb  —  5n2b  Anb2  .  n2b  2mn2  —  2n3/3' 

— - +  iogPr+  + - _ — n,+ 

Zj^r  ^ Pc  P 


(3  •  (  nblogpr  + 


2 mn  —  n2  n2  \ 


Pr 


8  n 


An 


H - +  a  •  {  —  log  pr  +  —  log  p, 

Pc  J 


If 


we  pick  pr  =  pc  =  p  (assuming  m  «  n)  and  b  =  then  we  obtain  the  leading  order  costs 


7- 


2  mn2  — 


2n3/3\ 


P 


+  P 


2  mn  +  n2  /  2 

Vp 


+  a  ■  (6^  log2  p)  , 


shown  in  the  third  row  of  Table  2. 
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Comparing  the  leading  order  costs  of  CAQR-HR  with  the  existing  approaches,  we  note  again  the 
0{n  log p)  latency  cost  incurred  by  the  2D-Householder-QR  algorithm.  CAQR  and  CAQR-HR  eliminate 
this  synchronization  bottleneck  and  reduce  the  latency  cost  to  be  independent  of  the  number  of  columns 
of  the  matrix.  Further,  both  the  bandwidth  and  latency  costs  of  CAQR-HR  are  factors  of  O(logp)  lower 
than  CAQR  (when  m  «  n).  As  previously  discussed,  CAQR  includes  an  extra  leading  order  bandwidth 
cost  term  (/3  •  n2  log p/ A/p),  as  well  as  a  computational  cost  term  (7 •  ( n2b/pc )  logpr)  that  requires  the  choice 
of  a  smaller  block  size  and  leads  to  an  increase  in  the  latency  cost. 

Further,  the  Householder  form  allows  us  to  decouple  the  block  sizes  used  for  the  trailing  matrix  update 
from  the  width  of  each  TSQR.  This  is  a  simple  algorithmic  optimization  to  add  to  our  2D  algorithm,  which 
has  the  same  leading  order  costs.  We  show  this  nested  blocked  approach  in  Algorithm  8,  which  is  very 
similar  to  Algorithm  7,  and  does  not  require  any  additional  subroutines.  While  this  algorithm  does  not 
have  a  theoretical  benefit,  it  is  highly  beneficial  in  practice,  as  we  will  demonstrate  in  the  performance 
section.  We  note  that  it  is  not  possible  to  aggregate  the  implicit  TSQR  updates  in  this  way. 

Algorithm  8  [Y,  T,  R\  =  CAQR-HR- Agg(A) 

Require:  A  is  m  x  n  and  distributed  block-cyclically  on  p  =  pr  ■  pc  processors  with  block  size  61,  so  that 
each  bi  x  b\  block  Aij  is  owned  by  processor  n (i,j)  =  (i  mod  pr)  +  pr  ■  ( j  mod  pc) 

1:  for  i  =  0  to  n/62  —  1  do 

2:  \Xi-fn/b2— l,i j  Rii\  =  CAQR-HR(Aj.m^2_^  j) 

3:  Aggregate  updates  from  line  5  of  CAQR-HR  into  (m  —  i^-by-fe  matrix  Yri 

4:  for  r  £  [i.  m/62  —  1],  c  £  \i  +  1,  n/62  —  1]  do  in  parallel 

5:  n(r,  c)  computes  Wrc  =  Y?  ■  Arc 

6:  Allreduce  Wc  =  Ylr  ^rc  so  that  processor  column  n(:,c)  owns  Wc 

7:  n(r,  c)  computes  Arc  =  Arc  —  Yri  ■  Tj  •  Wc 

8:  Set  RiC  =  AiC 

9:  end  for 

10:  end  for 

Ensure:  A  =  (ri'Ua  -  R 


5  Correctness  and  Stability 


In  order  to  reconstruct  the  lower  trapezoidal  Householder  vectors  that  constitute  an  orthonormal  matrix  (up 
to  column  signs),  we  can  use  an  algorithm  for  LU  decomposition  without  pivoting  on  an  associated  matrix 
(Algorithm  5).  Lemma  5.1  shows  by  uniqueness  that  this  LU  decomposition  produces  the  Householder 
vectors  representing  the  orthogonal  matrix  in  exact  arithmetic.  Given  the  explicit  orthogonal  factor,  the 
LU  method  is  cheaper  in  terms  of  both  computation  and  communication  than  constructing  the  vectors  via 
Householder-  QR . 

Lemma  5.1.  Given  an  orthonormal  mxb  matrix  Q,  let  the  compact  QR  decomposition  of  Q  given  by  the 
Householder-QR  algorithm  (Algorithm  1)  be 


Q 


-YTYf  \  s, 


where  Y  is  unit  lower  triangular,  Y\  is  the  top  b  x  b  block  ofY,  and  T  is  the  upper  triangular  b  x  b  matrix 

[5] 

satisfying  t^  +  T1  =  YTY .  Then  S  is  a  diagonal  sign  matrix,  and  Q  — 


given  by 


0 


Q  - 


=  Y-(-TY(rS). 


has  a  unique  L  U  decomposition 

(7) 
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Proof.  Since  Q  is  orthonormal,  it  has  full  rank  and  therefore  has  a  unique  QR  decomposition  with  positive 
diagonal  entries  in  the  upper  triangular  matrix.  This  decomposition  is  Q  =  Q  ■  In,  so  the  Householder-QR 
algorithm  must  compute  a  decomposition  that  differs  only  by  sign.  Thus,  S'  is  a  diagonal  sign  matrix.  The 
uniqueness  of  T  is  guaranteed  by  [20,  Example  2.4], 

We  obtain  Equation  (7)  by  rearranging  the  Householder  QR  decomposition.  Since  Y  is  unit  lower 
triangular  and  T,  ,  and  S  are  all  upper  triangular,  we  have  an  LU  decomposition.  Since  Y  is  full 

rg\ 

column  rank  and  T,  ,  and  S  are  all  nonsingular,  Q  —  has  full  column  rank  and  therefore  has  a 
unique  LU  decomposition  with  a  unit  lower  triangular  factor.  □ 

Unfortunately,  given  an  orthonormal  matrix  Q,  the  sign  matrix  S  produced  by  Householder-QR  is  not 

Is] 

known,  so  we  cannot  run  a  standard  LU  algorithm  on  Q  —  .  Lemma  5.2  shows  that  by  running  the 

Modified-LU  algorithm  (Algorithm  5),  we  can  cheaply  compute  the  sign  matrix  S  during  the  course  of  the 
algorithm. 

Lemma  5.2.  In  exact  arithmetic,  Algorithm  5  applied  to  an  orthonormal  m  x  b  matrix  Q  computes  the 
same  Householder  vectors  Y  and  sign  matrix  S  as  the  Householder-QR  algorithm  (Algorithm  1)  applied  to 

Q. 

Proof.  Consider  applying  one  step  of  Modified-LU  (Algorithm  5)  to  the  orthonormal  matrix  Q,  where  we 
first  set  Su  =  —  sgn(gn)  and  subtract  it  from  gn.  Note  that  since  all  other  entries  of  the  first  column 
are  less  than  1  in  absolute  value  and  the  absolute  value  of  the  diagonal  entry  has  been  increased  by  1, 

the  diagonal  entry  is  the  maximum  entry.  Following  the  LU  algorithm,  all  entries  below  the  diagonal  are 

scaled  by  the  reciprocal  of  the  diagonal  element:  for  2  <  i  <  m, 

Qil  /c\ 

Vi  i  =  - : - 7 - r,  (8) 

gn  +  sgn(gn) 

where  Y  is  the  computed  lower  triangular  factor.  The  Schur  complement  is  updated  as  follows:  for 
2  <  i  <  m  and  2  <  j  <  b, 

«  =  (9) 

Now  consider  applying  one  step  of  the  Householder-QR  algorithm  (Algorithm  1).  To  match  the  LA- 
PACK  notation,  we  let  a  =  gn,  f3  =  —  sgn(cc)  •  ||gi||  =  —  sgn(a),  and  r  =  =  1  +  asgn(a),  where  we 

use  the  fact  that  the  columns  of  Q  have  unit  norm.  Note  that  (3  =  —  sgn(a)  =  Sn,  which  matches  the 
Modified-LU  algorithm  above.  The  Householder  vector  y  is  computed  by  setting  the  diagonal  entry  to  1 
and  scaling  the  other  entries  of  q\  by  =  Q+Sgn(a)  •  Thus,  computing  the  Householder  vector  matches 
computing  the  column  of  the  lower  triangular  matrix  from  Modified-LU  in  Equation  (8)  above. 

Next,  we  consider  the  update  of  the  trailing  matrix:  Q  =  (/  —  ryyT)Q.  Since  Q  has  orthonormal 
columns,  the  dot  product  of  the  Householder  vector  with  the  jth  column  is  given  by 


VT1,  =  E  ViHi  =  w  +  E  a  +  9“n(a)w 


QuQij 
a  +  sgn(cc) 


=  1- 


a  +  sgn(cy 


The  identity 


(1  +  asgn(a))  (  1  - 


a  +  sgn(a) 


implies  ryTqj  =  q\j.  Then  the  trailing  matrix  update  (2  <i<m  and  2  <  j  <  b)  is  given  by 


Qij  =  Qij  ~  yi{ryTqj)  =  qij  - 


qnqij 

a  +  sgn(a)  ’ 
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which  matches  the  Schur  complement  update  from  Modified-LU  in  Equation  (9)  above. 

Finally,  consider  the  jth  element  of  first  row  of  the  trailing  matrix  (j  >  2):  qtj  =  qtj  —  yi{ryT  qfl  =  0. 
Note  that  the  computation  of  the  first  row  is  not  performed  in  the  Modified-LU  algorithm.  The  correspond¬ 
ing  row  is  not  changed  since  the  diagonal  element  is  Y\  \  =  1.  However,  in  the  case  of  Householder-QR, 
since  the  first  row  of  the  trailing  matrix  is  zero,  and  because  the  Householder  transformation  preserves 
column  norms,  the  updated  (m  —  1)  X  (b  —  1)  trailing  matrix  is  itself  an  orthonormal  matrix.  Thus,  by 
induction,  the  rest  of  the  two  algorithms  perform  the  same  computations.  □ 

In  order  to  bound  the  error  in  the  decomposition  of  a  general  matrix  using  Householder  reconstruction, 
we  will  use  Lemma  5.4  below  as  well  as  the  stability  bounds  on  the  TSQR  or  “AllReduce”  algorithm 
provided  in  [24] .  We  restate  those  results  here. 

Lemma  5.3  ([24]).  Let  R  be  the  computed  upper  triangular  factor  of  m  x  b  matrix  A  obtained  via  the 
AllReduce  algorithm  using  a  binary  tree  of  height  L  (m/2L  >b).  Then  there  exists  an  orthonormal  matrix 
Q  such  that 

II A  -  QR\\f  <  fi(m,b,L,£)\\A\\F 

and 

II Q  ~  Q\\f  <  /2(m,6,L,e), 

where  fi(m,  b,  L,e)  ~  +  L&7e-2fc  and  f2(m,b,L,£)  ~  (bjc.£  +  Lbjc-2b)  Vb,  for  b^c.^  <  1  and 

b"fc-2b  I?  where  c  is  a  small  constant. 

The  following  lemma  shows  that  the  computation  performed  in  Algorithm  5  is  more  stable  than  general 
LU  decomposition.  Because  it  is  performed  on  an  orthonormal  matrix,  the  growth  factor  in  the  upper 
triangular  factor  is  bounded  by  2,  and  we  can  bound  the  Frobenius  norms  of  both  computed  factors  by 
functions  of  the  number  of  columns.  Further,  we  can  stably  compute  the  T  factor  (needed  to  apply  a 
blocked  Householder  update)  from  the  upper  triangular  LU  factor,  a  computationally  cheaper  method 
than  using  the  Householder  vectors  themselves. 

Lemma  5.4.  In  floating  point  arithmetic,  given  an  orthonormal  m  x  b  matrix  Q,  Algorithm  5  computes 
factors  S,  Y ,  and  T  such  that 


QS- 


-  YTY ; 


<  h(b,  e). 


where 


h(b,  e )  =  2%  (b2  +  (l  +  V2j  bj 

and  Y\  is  given  by  the  first  b  rows  ofY. 

Proof.  This  result  follows  from  the  standard  error  analysis  of  LU  decomposition  and  triangular  solve  with 
multiple  right  hand  sides,  and  the  properties  of  the  computed  lower  and  upper  triangular  factors. 

\S~\ 

First,  we  use  the  fact  that  for  LU  decomposition  of  Q  — 


(see  [25,  Section  9.3]) 


0 


( 

's' 

\  -  - 

(q- 

0 

J  -  YU 

into  triangular  factors  Y  and  U ,  we  have 
<lb\\Y\\F\\U\\F. 


Second,  we  compute  T  =  ( US)Y1  T  using  the  standard  algorithm  for  triangular  solve  with  multiple  right 
hand  sides  to  obtain  (see  [25,  Section  8.1]) 


U  -  TY{  S 


US  -  TY, 


<76||T|M|LiT||F. 
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Finally,  we  bound  the  norms  of  the  computed  triangular  factors  in  exact  arithmetic.  At  each  step 
of  the  Modified-LU  algorithm,  the  trailing  matrix  is  an  orthogonal  matrix  (before  modifying  the  diagonal 
element).  Each  column  of  Y  is  computed  by  scaling  down  elements  of  a  unit  vector  and  setting  the  diagonal 
element  to  1.  Thus,  the  sum  of  squares  of  elements  in  a  column  is  bounded  by  2,  and  ||T||F  <  y/2b.  Note 
that  this  also  implies  ||  Yj  \\p  <  \[2h.  Each  row  of  U  is  computed  by  increasing  the  absolute  value  of  the 
diagonal  element  by  1,  but  leaving  the  other  elements  to  the  right  of  the  diagonal  unchanged.  The  sum 
of  squares  of  elements  in  a  row  of  a  matrix  with  orthonormal  columns  is  at  most  1.  Thus,  the  sum  of 
squares  of  elements  in  a  row  of  U  is  bounded  by  4,  and  ||[/||F  <  2 \/b.  From  [22,  Theorem  13],  we  have 

||f||F  <6+1. 

Therefore,  altogether  we  have 


^Tfell^llFlI^llF  +  Tbll^llFlirilFllifllF 
<  76  (\/8b  +  26(6  +  1)^  . 


□ 

Theorem  5.5.  Let  R  be  the  computed  upper  triangular  factor  of  m  x  6  matrix  A  obtained  via  the  TSQR 
algorithm  using  a  binary  tree  of  height  L  (m/ 2L  >  b),  and  let  Q  =  I  —  YTY ±  and  R  =  SR  where  Y ,  T, 
and  S  are  the  computed  factors  obtained  from  the  Householder  reconstruction  algorithm.  Then 

\\A-  QR\\f  <  Fi(m,  b,L,e)\\A\\F 


and 

|| I  -  QtQ\\f  <  F2(m,  6,  L,  e) 

where 

Fi(m,  6,  L,  e)  ~  {b^c.xi_  +  Lb^c.2b){l  +  Vb)  +  2jb  (b2  +  (l  +  y/2j  b^j 

and 

F2(m ,  6,  L,  e)  ~  2(6yc.^  +  Lb~fc.2b)Vb  +  4yb  ^62  +  ^1  +  y/2j  b'j 

for  b^c.2R_  <<  1  and  b^yc.2b  <<  1. 

2l 

Proof.  From  Lemmas  5.3  and  5.4,  there  exists  an  exactly  orthonormal  matrix  Q  such  that 

\\A  -  QR\\f  =  || A  -QR-(Q-  Q)R  -  (Q  -  QS)R\\F 

<  \\A  -  QR\\f  +  || Q  -  Q\\f\\R\\f  +  || Q~  Q5||F||i2||F 

<  fi(m,b,  L,£)\\A\\f  +  f2(m,b,  L,e)\\R\\F  +  f3(b,  £)\\R\\F 

<  Fi(m,b,L,£)\\A\\F. 

The  approximation  of  F\  also  uses  [24,  Lemma  1]:  ||i?||F  ~  ||A||F  assuming  b^c.nL  <C  1  and  b^c.  26  -C  1. 
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Further,  since  Q  is  exactly  orthonormal,  we  have 


I  -  qTq\\f  =  || QT(Q  -  QS )  +  (Q  -  QS)tQ\\f 

<  \\Q  -  QS\\f  (i  +  ||Q||f) 

<  (\\Q-Q\\f  +  \\Q-QS\\f)  (i  +  IIQUf) 

<  (f2(m,b,L,e)  +  f3(b,e))  (l  +  ||QI|f) 

<  F2{m,b,L,£). 

□ 


6  Numerical  experiments 

In  this  section  we  present  numerical  results  of  TSQR-HR.  Experiments  were  conducted  on  two  represen¬ 
tative  sets  of  test  matrices.  The  first  set  is  used  to  check  the  stability  of  the  algorithm  on  single  panels 
represented  by  tall  and  skinny  matrices,  while  the  second  set  focuses  on  the  factorization  of  full  matrices 
panel  by  panel. 

6.1  Tall  and  Skinny  Matrices 

In  this  section,  we  use  matrices  which  are  formed  by  A  =  Q  ■  Rp ,  where  Q  and  R  are  computed  via  QR 
decomposition  of  an  m  x  b  matrix  with  i.i.d.  entries  chosen  from  a  normal  distribution.  Rp  is  obtained  by 
setting  the  |_§ J-th  diagonal  element  of  an  upper  triangular  matrix  R  to  a  small  parameter  value  p.  This 
experimental  setup  is  used  to  vary  the  condition  number  of  the  matrix  and  demonstrate  instability  of  the 
cheaper  method  described  in  Section  4.2. 

As  can  be  seen  from  Table  3,  for  all  test  cases  both  the  orthogonality  and  factorization  errors  of  are 
of  order  10-15  for  TSQR-HR,  which  is  close  to  e  =  2-52  of  double  precision.  This  demonstrates  the 
numerical  stability  of  this  approach  on  tall  and  skinny  matrices.  Table  3  shows  a  comparison  between 
three  approaches:  TSQR-HR,  Yamamoto’s  approach  described  in  Section  3.3,  and  LU (A  —  R)  described 
in  Section  4.2.  While  Yamamoto’s  approach  is  as  stable  as  TSQR-HR,  the  LU(A  —  R)  method  provides 
unsatisfactory  backward  errors  which  grow  with  the  condition  number  of  the  matrix. 

6.2  Square  Matrices 

We  now  present  numerical  results  for  the  QR  factorization  of  square  matrices  using  a  panel-by-panel 
factorization.  The  matrices  are  generated  similarly  to  [26],  where  the  set  is  chosen  from  well-known 
anomalous  matrices.  Most  are  of  size  1000-by-1000  (except  the  ARC130  and  FS_541_1  matrices,  which  are 
respectively  of  order  130  and  541),  with  various  condition  numbers,  some  of  them  being  very  ill-conditioned 
(i.e.  having  a  condition  number  much  larger  than  the  inverse  of  machine  precision).  We  tried  several  panel 
sizes  ranging  from  2  to  256  columns  and  report  only  the  largest  errors  and  their  associated  panel  widths. 

Results  from  Table  5  show  that  TSQR-HR  is  numerically  stable  in  terms  of  backward  errors  when 
computing  the  QR  factorization  of  full  matrices,  regardless  of  the  condition  number  of  the  matrix,  as 
suggested  by  Theorem  5.5.  Again,  Yamamoto’s  approach  displays  similar  results.  To  the  contrary,  the 
alternative  approach  of  LU(A  —  R)  does  not  yield  numerically  stable  results  in  practice,  confirming  what 
was  observed  on  tall  and  skinny  matrices.  Altogether,  these  two  sets  of  experiments  demonstrate  the 
numerical  stability  of  the  proposed  approach  on  representative  matrices. 
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Q  —  /  (T  from  Algorithm  5) 
norm-wise  col-wise  ortho. 

Yamamoto’s  approach 
norm-wise  col-wise  ortho. 

A  —  R  (T-i  from  Y'Y) 
norm-wise  col-wise  ortho. 

p 

K 

le-01 

5.1e+02 

2.2e-15 

2.7e-15 

9.3e-15 

2.5e-15 

3. le-15 

9.2e-15 

3.8e-14 

1.7e-14 

5.5e-15 

le-02 

5.0e+03 

2.3e-15 

2.9e-15 

1.0e-14 

2.4e-15 

3. le-15 

1.  le-14 

3.2e-13 

1. le-13 

6.2e-15 

le-03 

5.0e+04 

2.2e-15 

2.6e-15 

8.4e-15 

2.6e-15 

3.4e-15 

1.  le-14 

4.2e-12 

1.7e-12 

5.6e-15 

le-04 

4.9e+05 

2.2e-15 

2.6e-15 

7.7e-15 

2.3e-15 

2.8e-15 

8.7e-15 

3.8e-ll 

1.7e-ll 

5.4e-15 

le-05 

5.1e+06 

2.3e-15 

2.9e-15 

8.7e-15 

3.2e-15 

4.2e-15 

1.0e-14 

3.9e-10 

1.4e-10 

5.3e-15 

le-06 

5.0e+07 

2.3e-15 

3.0e-15 

9. le-15 

3.0e-15 

3.9e-15 

1.0e-14 

3.6e-09 

1.5e-09 

6. le-15 

le-07 

5.0e+08 

2.4e-15 

3.4e-15 

1. le-14 

2.7e-15 

3.7e-15 

9.9e-15 

4.2e-08 

2.  le-08 

5.0e-15 

le-08 

5.1e+09 

2.2e-15 

2.8e-15 

8.6e-15 

2.5e-15 

3. le-15 

8.9e-15 

3.8e-07 

1.5e-07 

5.8e-15 

le-09 

5.0e+10 

2.3e-15 

3. le-15 

9.9e-15 

3.9e-15 

5. le-15 

1.3e-14 

3.6e-06 

2.0e-06 

5.4e-15 

le-10 

5.0e+H 

2. le-15 

2.6e-15 

7. le-15 

2.6e-15 

3.4e-15 

9.9e-15 

3.3e-05 

1.2e-05 

6.3e-15 

le-11 

4.9e+12 

2.5e-15 

3.4e-15 

1.0e-14 

2.4e-15 

3. le-15 

1.0e-14 

3.  le-04 

1.2e-04 

5.9e-15 

le-12 

5.1e+13 

2.2e-15 

2.9e-15 

8.5e-15 

2.6e-15 

3.3e-15 

1.2e-14 

3.7e-03 

1.6e-03 

5.8e-15 

le-13 

5.0e+14 

2.2e-15 

2.7e-15 

8.8e-15 

3.0e-15 

3.9e-15 

1.0e-14 

4.0e-02 

1.4e-02 

4.7e-15 

le-14 

3.5e+15 

2.3e-15 

3. le-15 

1.0e-14 

2.3e-15 

2.9e-15 

9.4e-15 

2.7e-01 

9.7e-02 

4.9e-15 

le-15 

5.0e+15 

2.4e-15 

3. le-15 

9.7e-15 

2.8e-15 

3.7e-15 

9.4e-15 

3.5e-01 

1.3e-01 

6.3e-15 

Table  3:  Error  on  tall  and  skinny  matrices  (m  =  1000,  n  =  200)  for  three  approaches.  The  label  “norm- 
wise”  corresponds  to  \\A  —  QR\\-2,  “col-wise”  corresponds  to  maxj  \\Ai  —  (QR)i\\ 2,  and  “ortho.”  corresponds 
to  ||/-QtQ||2. 


Id 

Matrix  type 

Size 

hi 

1. 

A  =  2  *  rand(m)  —  1 

1000-by-1000 

2.106e  +  03 

2. 

A  =  diag((  10  *  eps)™)  *  rand(m) 

1000-by-1000 

7.731e+  17 

3. 

Golub-Klema-Stewart:  A(i,i)  =  -jq,A(i,j  >  i)  =  ~^,A(i,j  <  i)  =  0 

1000-by-1000 

2.242e  +  20 

4. 

Break  1  distribution:  A  =  gallery  ('randsvd' ,  n,  le9,  2) 

1000-by-1000 

1.00e  +  09 

5. 

Break  9  distribution:  A  =  gallery  ('randsvd'  ,n,  le9, 1) 

1000-by-1000 

1.00e  +  09 

6. 

A  =  orth(rand(n))  ■  dia<7([100, 10,  linspace( le  —  8,  le  —  2,  n  —  2)]) 

A  =  A  ■  orth(rand(n )) 

1000-by-1000 

1.00e  +  10 

7. 

v  =  linspace(  1,  le  —  3,  n);  t>(51  :  end)  =  0; 

A  =  orth(rand(n))  ■  diag(v)  •  orth(rand(n))  +  0.1  *  t>(50)  *  rand(n) 

1000-by-1000 

8.464e  -(-  04 

8. 

UYV1  with  exponential  distribution 

1000-by-1000 

4.155e+  19 

9. 

The  devil’s  stairs  matrix 

1000-by-1000 

2.275e+  19 

10. 

KAHAN  matrix,  a  trapezoidal  matrix 

1000-by-1000 

5.642e  +  56 

11. 

Matrix  ARC130  from  Matrix  Market 

130-by-130 

6.054e+  10 

12. 

Matrix  FS_541_1  from  Matrix  Market 

541-by-541 

4.468e  +  03 

13. 

BAART  Test  problem:  Fredholm  integral  equation  of  the  first  kind 

1000-by-1000 

5.251e+  18 

14. 

BLUR  Test  problem:  digital  image  deblurring. 

A  is  a  symmetric,  doubly  block  Toeplitz  matrix 

961-by-961 

3.075e  +  01 

15. 

DERIV2  Test  problem:  computation  of  the  second  derivative 

1000-by-1000 

1.216e  +  06 

16. 

Matrix  Exl-CMRS 

1000-by-500 

1.398e+  17 

17. 

Matrix  Ex2-RST 

1024-by-512 

3.276e+  16 

18. 

FOXGOOD  Test  problem:  severely  ill-posed  problem 

1000-by-1000 

5.742e  +  20 

19. 

GRAVITY  Test  problem:  1-D  gravity  surveying  model  problem 

1000-by-1000 

8.797e  +  20 

20. 

HEAT  Test  problem:  inverse  heat  equation 

1000-by-1000 

1.070e  +  232 

21. 

PARALLAX  Stellar  parallax  problem  with  28  fixed,  real  observations 

26-by-1000 

4.629e+  14 

22. 

PHILLIPS  Test  problem:  discretization  of  the  ‘famous’  first-kind 
Fredholm  integral  equation  deviced  by  D.  L.  Phillips 

1000-by-1000 

2.641e  +  10 

23. 

SHAW  Test  problem:  one-dimensional  image  restoration  model 

1000-by-1000 

2.441e  +  21 

24. 

SPIKES  Test  problem  with  a  ’’spiky”  solution 

1000-by-1000 

5.796e  +  21 

25. 

TOMO  is  a  2D  tomography  test  problem 

961-by-961 

1.091e+  17 

Table  4:  Description  of  the  full  matrices  used  in  the  experiments. 
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Id 

Q  —  I  (T  from  Algorithm  5) 
norm-wise  error  column-wise  error  orthogonality 

Yamamoto’s  approach 
norm-wise  error  column-wise  error 

orthogonality 

A  - 

norm-wise  error 

R(T-L  from  Y 1  Y ) 
column-wise  error 

orthogonality 

1 

4.3e-15  (256) 

3.7e-15  (256) 

2.8e-14  (2) 

4.7e-15  (256) 

4.4e-15  (256) 

2.8e-14  (2) 

3.5e-15  (256) 

3.5e-15  (256) 

2.8e-14  (2) 

2 

2.0e-15  (256) 

6.6e-15  (64) 

2.6e-14  (2) 

2.6e-15  (256) 

7.4e-15  (256) 

2.6e-14  (2) 

2.7e-15  (256) 

9.0e-15  (256) 

3.1e-14  (2) 

3 

0.0e+00  (2) 

0.0e+00  (2) 

0.0e+00  (2) 

0.0e+00  (2) 

0.0e+00  (2) 

0.0e+00  (2) 

0.0e+00  (2) 

0.0e+00  (2) 

0.0e+00  (2) 

4 

1.0e-14  (256) 

5.5e-15  (256) 

2.8e-14  (2) 

l.le-14  (256) 

6.3e-15  (256) 

2.8e-14  (2) 

4.4e-14  (256) 

4.4e-14  (256) 

2.8e-14  (2) 

5 

9.9e-15  (256) 

5.1e-15  (256) 

2.9e-14  (2) 

1.0e-14  (256) 

6.2e-15  (256) 

2.8e-14  (2) 

4.2e-14  (256) 

3.0e-14  (256) 

2.8e-14  (2) 

6 

1.4e-15  (256) 

5.9e-15  (256) 

2.8e-14  (2) 

2.2e-15  (128) 

6.9e-15  (256) 

2.8e-14  (2) 

1.4e-15  (256) 

1.8e-14  (256) 

2.8e-14  (2) 

7 

1.4e-15  (256) 

4.6e-15  (64) 

2.7e-14  (2) 

3.5e-15  (256) 

7.7e-15  (256) 

2.8e-14  (2) 

2.3e-15  (256) 

•5.8e-15  (256) 

2.8e-14  (2) 

8 

2.0e-15  (256) 

4.3e-15  (64) 

2.8e-14  (2) 

3.8e-15  (256) 

7.0e-15  (256) 

2.8e-14  (2) 

1.9e-15  (256) 

1.4e-14  (256) 

2.8e-14  (2) 

9 

2.4e-15  (256) 

5.0e-15  (256) 

2.8e-14  (2) 

2.7e-15  (256) 

5.4e-15  (256) 

2.8e-14  (2) 

2.9e-15  (256) 

1.5e-14  (256) 

2.8e-14  (2) 

10 

0.0e+00  (2) 

0.0e+00  (2) 

0.0e+00  (2) 

0.0e+00  (2) 

0.0e+00  (2) 

0.0e+00  (2) 

0.0e+00  (2) 

0.0e+00  (2) 

0.0e+00  (2) 

11 

8.8e-19  (16) 

1.3e-15  (16) 

2.1e-15  (2) 

l.le-18  (16) 

1.8e-15  (16) 

3.9e-15  (16) 

3.9e-19  (32) 

8.7e-16  (32) 

1.7e-15  (2) 

12 

5.8e-16  (64) 

1.3e-15  (2) 

1.8e-15  (256) 

8.4e-16  (16) 

1.9e-15  (32) 

2.8e-15  (64) 

6.0e-16  (256) 

1.3e-15  (32) 

1.7e-15  (2) 

13 

1.6e-15  (32) 

6.2e-15  (256) 

2.9e-14  (2) 

2.0e-15  (32) 

7.3e-15  (256) 

2.8e-14  (2) 

3.0e-07  (256) 

1.5e-06  (256) 

2.8e-14  (2) 

14 

1.0e-15  (32) 

1.5e-15  (16) 

4.7e-15  (2) 

1.6e-15  (256) 

2.8e-15  (256) 

6.8e-15  (128) 

8.2e-16  (16) 

1.5e-15  (16) 

3.9e-15  (2) 

15 

2.8e-15  (256) 

8.7e-15  (256) 

4.6e-14  (2) 

1.0e-14  (256) 

1.6e-14  (256) 

4.8e-14  (2) 

1.0e-12  (256) 

5.3e-12  (256) 

5.6e-14  (2) 

16 

1.8e-15  (256) 

5.7e-15  (256) 

4.9e-14  (2) 

1.7e-15  (256) 

5.7e-15  (256) 

4.0e-14  (2) 

4.3e-07  (256) 

2.3e-06  (256) 

3.3e-14  (2) 

17 

3.7e-15  (32) 

l.le-14  (16) 

5.0e-14  (2) 

3.0e-15  (16) 

9.8e-15  (32) 

3.8e-14  (2) 

1.4e-14  (256) 

2.3e-14  (256) 

4.5e-14  (2) 

18 

2.4e-15  (256) 

6.4e-15  (16) 

2.8e-14  (2) 

3.5e-15  (256) 

8.0e-15  (256) 

2.8e-14  (2) 

8.2e-04  (256) 

5.7e-03  (256) 

2.8e-14  (2) 

19 

2.3e-15  (256) 

4.8e-15  (64) 

2.9e-14  (2) 

3.1e-15  (256) 

7.1e-15  (256) 

2.8e-14  (2) 

2.1e-03  (256) 

l.le-02  (256) 

2.8e-14  (2) 

20 

2.6e-15  (256) 

4.5e-15  (256) 

3.9e-14  (2) 

2.6e-15  (256) 

5.1e-15  (256) 

4.0e-14  (2) 

3.8e-02  (256) 

1.7e-01  (256) 

3.4e-14  (2) 
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8.8e-16  (16) 

1.9e-15  (32) 

2.6e-15  (16) 

l.le-15  (32) 

2.3e-15  (32) 

2.5e-15  (32) 

l.le-11  (32) 

1.8e-10  (32) 

2.3e-15  (16) 

22 

1.0e-15  (32) 

4.9e-15  (32) 

3.7e-14  (2) 

1.2e-15  (256) 

5.0e-15  (256) 

3.1e-14  (2) 

5.2e-ll  (256) 

1.9e-10  (256) 

3.4e-14  (2) 
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2.5e-15  (256) 

5.3e-15  (64) 

2.8e-14  (2) 

2.8e-15  (256) 

7.1e-15  (256) 

2.9e-14  (2) 

7.0e-04  (256) 

5.2e-03  (256) 

2.8e-14  (2) 

24 

7.2e-16  (32) 

3.8e-15  (256) 

2.8e-14  (2) 

1.2e-15  (256) 

4.7e-15  (256) 

2.8e-14  (2) 

2.2e-03  (256) 

2.6e-02  (256) 

2.8e-14  (2) 

25 

2.2e-15  (256) 

4.3e-15  (256) 

2.2e-14  (2) 

2.3e-15  (256) 

•5.0e-15  (256) 

2.2e-14  (2) 

2.3e-15  (256) 

1.2e-14  (256) 

2.3e-14  (2) 

Table  5:  Error  of  full  matrices  (n  =  1000).  The  label  “norm-wise”  corresponds  to  ||A  —  QR ((2,  “col-wise” 
corresponds  to  max*  ||  A;  —  (QR)i || 2 ,  and  “ortho.”  corresponds  to  \\I  —  QTQ\\2-  Descriptions  of  the  matrices 
can  be  found  in  Table  4. 


7  Performance 

Having  established  the  stability  of  our  algorithm,  we  now  analyze  its  experimental  performance.  We 
demonstrate  that  for  tall  and  skinny  matrices  TSQR-HR  achieves  better  parallel  scalability  than  library 
implementations  (ScaLAPACK  and  Elemental)  of  Householder-QR.  Further,  we  show  that  for  square 
matrices  CAQR-HR-Agg  outperforms  our  implementation  of  CAQR,  and  library  implementations  of  2D- 
Householder-  QR . 

7.1  Architecture 

The  experimental  platform  is  “Hopper,”  which  is  a  Cray  XE6  supercomputer,  built  from  dual-socket  12-core 
“Magny-Cours”  Opteron  compute  nodes.  We  used  the  Cray  LibSci  BLAS  routines.  This  machine  is  located 
at  the  NERSC  supercomputing  facility.  Each  node  can  be  viewed  as  a  four-chip  compute  configuration  due 
to  NUMA  domains.  Each  of  these  four  chips  have  six  super-scalar,  out-of-order  cores  running  at  2.1  GHz 
with  private  64  KB  LI  and  512  KB  L2  caches.  Nodes  are  connected  through  Cray’s  “Gemini”  network, 
which  has  a  3D  torus  topology.  Each  Gemini  chip,  which  is  shared  by  two  Hopper  nodes,  is  capable  of 
9.8  GB/s  bandwidth. 

7.2  Parallel  Scalability 

In  this  section,  we  give  performance  results  based  on  a  C+- I-/MPI/LAPACK  implementation  of  TSQR 
and  TSQR-HR,  as  well  as  two  library  implementations  of  2D-Householder-QR,  Elemental  (version  0.80) 
and  ScaLAPACK  (native  LibSci  installation  on  Hopper,  October  2013).  Our  implementations  aim  to  do 
minimal  communication  and  arithmetic,  and  do  not  employ  low-level  tuning  or  overlap  between  commu¬ 
nication  and  computation.  All  the  benchmarks  use  one  MPI  process  per  core,  despite  the  fact  that  is 
favorable  on  Hopper  to  use  one  process  per  socket  and  six  threads  per  process.  This  decision  was  made 
because  we  observed  that  some  of  the  many  LAPACK  routines  used  throughout  our  codes  (geqrf ,  ormqr, 
tpqrt,  tmpqrt,  etc.)  were  not  threaded. 
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QR  weak  scaling  on  Hopper  (15K-by-15K  to  131  K-by-1 31 K) 


#cores 


(a)  Tall-skinny  QR  performance  on  Cray  XE6 


QR  weak  scaling  on  Hopper  (15K-by-15K  to  131  K-by-1 31 K) 


First,  we  study  the  performance  of  QR  factorization  of  tall-skinny  matrices  using  a  ID  processor  grid. 
Figure  1(a)  gives  the  strong  scaling  performance  for  a  matrix  of  size  122,880-by-32.  We  also  tested  a  range 
of  reasonable  panel  sizes  that  are  not  detailed  here  and  observed  similar  performance  trends.  We  observe 
from  Figure  1(a)  that  TSQR-HR  takes  roughly  twice  the  execution  time  of  TSQR,  which  is  in  line  with 
our  theoretical  cost  analysis.  Figure  1(a)  also  gives  the  time  to  solution  of  Elemental  and  ScaLAPACK, 
which  both  use  the  Householder- QR  algorithm,  albeit  with  different  matrix  blocking  and  collectives.  We 
see  that  TSQR  obtains  a  performance  benefit  over  Householder-QR  due  to  the  lower  synchronization  cost 
and  TSQR-HR  preserves  the  scaling  behavior  with  roughly  a  factor  of  two  overhead. 

Second,  we  study  the  parallel  scaling  of  QR  factorization  on  square  matrices.  In  Figure  1(b),  we  compare 
our  implementation  of  CAQR  (with  a  simple  binary  tree  update,  no  pipelining  or  other  optimizations), 
CAQR-HR,  and  CAQR-HR-Agg,  with  Elemental  and  ScaLAPACK,  which  use  2D-Householder-QR.  We 
tuned  the  block  sizes  of  all  the  codes  (the  CAQR-HR-Agg  required  tuning  two  block  sizes),  though  fewer 
data  points  were  collected  for  larger  scale  runs,  due  to  timing  and  allocation  constraints. 

Comparing  the  performance  of  CAQR-HR-Agg  and  CAQR-HR  in  Figure  1(b),  we  observe  that  signifi¬ 
cant  benefit  is  obtained  from  aggregating  the  trailing  matrix  update.  By  combining  the  aggregated  update 
and  lower  synchronization  cost  of  TSQR,  CAQR-HR-Agg  outperforms  ScaLAPACK  and  Elemental  and 
achieves  better  parallel  scalability.  On  the  other  hand,  the  CAQR  performance  is  relatively  poor  due  to 
the  overhead  of  the  implicit  tree  trailing  update.  We  also  note  that  for  Elemental,  ScaLAPACK,  and  all  of 
our  QR  implementations,  it  was  often  better  to  utilize  a  rectangular  processor  grid  with  more  rows  than 
columns.  Having  more  processes  in  each  column  of  the  processor  grid  accelerates  the  computation  of  each 
tall-skinny  panel. 

8  Conclusion 

In  this  paper,  we  introduce  a  method  for  recovering  the  Householder  basis-kernel  representation  from  any 
matrix  with  orthonormal  columns  in  a  stable  and  efficient  manner.  Our  method  was  motivated  by  the 
desire  to  combine  the  efficiency  of  the  TSQR  algorithm  with  that  of  the  trailing  matrix  updates  within 
Householder-QR  in  the  context  of  computing  the  QR  factorization  of  general  matrices. 

We  argue  using  a  performance  cost  model  that  the  savings  from  combining  the  approaches  outweigh 
the  extra  cost  required  to  reconstruct  the  Householder  vectors  for  each  panel  factorization,  observing 
asymptotic  improvements  over  both  existing  approaches.  We  also  demonstrate  that  our  algorithm  is 
practical,  attaining  speed-ups  of  up  to  1.5x  for  CAQR-HR-Agg  over  the  standard  2D-Householder-QR 
algorithm,  and  we  conjecture  that  larger  speed-ups  may  be  achieved  on  future  architectures,  which  are 
becoming  increasingly  more  synchronization  and  communication  bound. 

Our  approach  provides  a  promising  direction  for  heterogenous  architectures  (as  suggested  in  [11]),  where 
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synchronization-avoidance  and  high  granularity  computation  have  even  more  pervasive  effects  on  perfor¬ 
mance  efficiency.  Furthermore,  because  our  approach  recovers  the  standard  representation  of  orthogonal 
matrices  (as  is  used  in  libraries  like  LAPACK),  we  are  able  to  re-use  the  existing  software  infrastructure 
and  maintain  performance  portability. 

Finally,  we  conjecture  that  the  Householder  reconstruction  technique  will  enable  the  design  of  a  QR 
algorithm  which  is  as  stable  as  Householder  QR  and  reduces  the  bandwidth  cost  asymptotically  compared 
to  parallel  CAQR.  We  aim  to  reduce  the  asymptotic  bandwidth  cost  for  QR  as  done  by  Tiskin  [27],  except 
in  a  more  practical  manner,  following  the  communication-optimal  parallel  algorithm  for  LU  [28]. 
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A  Improved  CAQR 

In  this  Appendix,  we  show  how  the  CAQR  algorithm  can  be  improved  via  a  more  efficient  trailing  matrix 
update  algorithm.  The  main  idea  of  this  approach  is  to  use  a  recursive  halving  and  recursive  doubling 
approach  over  the  columns  of  the  trailing  matrix  while  applying  the  TSQR  tree.  This  task  can  be  achieved 
via  a  butterfly  communication  network  which  performs  the  update  and  scatters  the  data,  followed  by 
an  inverted  butterfly  network  which  collects  the  scattered  data  via  recursive  doubling.  Performing  the 
trailing  matrix  update  in  this  manner  also  requires  replicating  the  TSQR  tree  as  a  butterfly  (Householder 
data  replicated  2k  times  at  the  fcth  level).  The  Householder  tree  data  may  be  replicated  explicitly  via 
communication  or  can  be  computed  redundantly  during  the  factorization  phase  via  a  butterfly  network 
TSQR  (see  Algorithm  9)  for  no  extra  parallel  cost.  We  arrived  at  this  algorithm  only  after  implementing 
and  evaluating  CAQR  with  a  simple  binary  tree  network  for  the  TSQR  and  the  trailing  matrix  update. 
We  plan  to  implement  and  study  the  performance  of  this  trailing  matrix  update  in  the  context  of  CAQR, 
as  it  alleviates  important  algorithmic  bottlenecks  on  which  we  elaborate  below. 

Algorithm  9  demonstrates  how  TSQR  can  be  done  via  butterfly  tree  network.  The  main  difference 
between  the  butterfly  algorithm  and  the  binary  tree  TSQR  is  the  fact  that  the  butterfly  algorithm  com¬ 
putes  and  stores  each  Yt  ^  redundantly  on  2k  processors  on  level  k.  Algorithm  10  uses  the  redundantly 
stored  representation  produced  by  Algorithm  9  to  perform  the  trailing  matrix  update  more  efficiently.  In 
particular,  at  each  level  of  the  first  butterfly  network  in  Algorithm  10,  the  working  part  of  the  columns  of 
the  trailing  matrix  are  partitioned  in  half  and  scattered  amongst  pairs  of  processor  rows.  In  the  binary 
tree  application  method  (Algorithm  3),  half  the  processors  were  assigned  work  at  each  successive  level 
of  the  tree.  In  our  butterfly  algorithm,  all  the  processors  are  assigned  half  the  work  at  each  successive 
level  of  the  butterfly.  As  a  result,  the  number  of  columns  each  processor  works  on  reduces  in  half  at  each 
successive  level  in  the  butterfly.  We  note  that  Algorithm  10  is  different  from  the  butterfly  algorithm  for 
the  trailing  matrix  application  presented  in  [19],  which  applies  the  redundant  copies  of  to  the  columns 
of  the  trailing  matrix  redundantly. 

The  benefit  achieved  by  Algorithm  10  over  Algorithm  3  is  reflected  both  in  communication  bandwidth 
and  computational  cost.  The  number  of  words  moved  by  each  processor  in  Algorithm  10  is  given  to  leading 
order  by 

logp 

2  ^  nb/2l  =  2 nb. 
i= 1 

We  note  that  this  is  a  factor  of  logp  smaller  than  the  bandwidth  cost  associated  with  Algorithm  3.  Further, 
excluding  the  cost  of  the  initial  application  on  line  1  (which  also  occurs  in  Algorithm  3),  the  computational 
cost  of  the  algorithm  is  given  by 

logp 

^0(nb2/2i)  =  0(nb2), 
i=l 

which  compares  favorably  to  Algorithm  3  which  has  a  cost  of  0{nb 2  logp).  The  reduction  in  bandwidth  cost 
yielded  by  this  algorithm  leads  directly  to  a  reduction  in  bandwidth  cost  from  0(mra+^1°s'p)  to  0(mnJP  ) 
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(where  we  assume  pr  =  pc  =  \/p)-  Further,  the  reduction  in  floating  point  cost  makes  it  possible  to  select 
a  block  size  for  CAQR  which  is  a  factor  of  logp  larger  without  incurring  any  overhead  in  leading  order 
computational  cost.  Previously,  the  block  size  was  made  smaller  due  to  the  0{n2b  log p)  computational 
term  associated  with  using  Algorithm  3  for  the  trailing  matrix  update.  Raising  the  block  size  by  a  factor  of 
logp  yields  a  reduction  in  the  latency  cost  of  CAQR  by  the  same  factor.  The  overall  cost  of  the  improved 
CAQR  algorithm  (using  Algorithms  9  and  10)  is  given  to  leading  order  for  nearly  square  matrices  by 

f  2mn2  —  2n3/3\  n  f 2mn  +  2n2\  ,  2  \\ 

7  •  +  fl  •  - — - J  +  a  •  (7^1og 2  p))  • 

Therefore,  it  is  possible  for  CAQR  to  achieve  the  same  asymptotic  costs  as  CAQR-HR;  however,  it  is  not 
possible  to  aggregate  the  update  to  obtain  the  practical  benefits  harvested  by  CAQR-HR-Agg.  Further, 
the  reconstruction  of  Householder  vectors  requires  less  software  engineering  to  be  incorporated  into  high 
performance  numerical  linear  algebra  libraries. 


Algorithm  9  [{Yjtk},R}  =  Butterfly-TSQR(A) 

Require:  Number  of  processors,  p.  is  a  power  of  two  and  i  is  the  processor  index 
Require:  A  is  m  x  b  matrix  distributed  in  block  row  layout;  A%  is  processor  V s  block 


1 

2 

3 

4 

5 

6 
7 

8: 


9 

10 

11 

12: 

13 

14 

15 


[Yifi,Ri]  =  Householder-QR(Aj) 
for  A;  =  1  to  logp  do 

%  Determine  my  neighbor  in  this  level  of  the  butterfly 
j  =  2k[^\  +  (i  +  2fc_1  mod  2k) 
if  i  <  j  then 

Send  Rj  from  processor  j 
Receive  Rj  from  processor  j 

[Y^kjRi]  =  Householder- QR 

else 

Receive  Rj  from  processor  j 
Send  Rj  to  processor  j 

[Yi^,Ri]  =  Householder- QR 

end  if 
end  for 

R  =  R{ 

Ensure:  A  =  QR  with  Q  implicitly  represented  by 

Ensure:  R  is  stored  redundantly  on  all  processors,  Yo,o  is  stored  by  processor  0,  and  for  i  >  0  is  stored 
redundantly  on  processor  j  for  each  j  +  2k~1  =  i  mod  2k 


Ri 
R  i 


Rj 

Ri 


B  Householder-QR  with  Hints 

The  idea  in  this  section  is  to  interpret  computing  the  LU  decomposition  of  A  —  R  (discussed  in  Section 
4.2)  as  performing  Householder-QR  on  A,  but  we  will  use  the  information  in  R  as  “hints”  to  avoid  parts 
of  the  computation  (particularly  the  communication-expensive  parts). 

Let  A^l  be  the  partially  factored  matrix  whose  first  i  columns  form  an  upper  triangular  matrix.  Since 
Y  is  a  lower  triangular  matrix  (the  height  of  the  nonzero  part  of  the  Householder  vectors  decreases  as 
the  algorithm  progresses),  the  ith  row  of  the  trailing  matrix  is  not  updated  after  the  ith  Householder 
reflection  is  applied.  This  implies  that  the  ?'t!l  row  of  A^  is  equal  to  the  ith  row  of  the  final  output;  that 
is,  #(i,:)  =  R(i,:)- 
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Algorithm  10  [. B ]  =  Scatter-Apply-TSQR-QT({Kj A) 

Require:  Number  of  processors,  p,  is  a  power  of  two  and  i  is  the  processor  index 

Require:  A  is  rn  x  n  matrix  distributed  in  block  row  layout;  A%  is  processor  i’s  block 

Require:  {L^}  is  the  implicit  representation  of  b  Householder  vectors  computed  via  a  butterfly  TSQR. 

1:  Bi  =  Apply-Householder-Q-^Y^o,  ^i) 

2:  Let  Bt  be  the  first  b  rows  of  B, 

3:  %  Butterfly  network  recursive-halving  TSQR 
4:  for  k  =  1  to  logp  do 

5:  %  Determine  my  neighbor  in  this  level  of  the  butterfly 

6:  j  =  2k[fE\  +  (i  +  2k~l  mod  2k ) 

7:  %  Subdivide  columns  via  recursive  halving 

8:  Let  Bj  =  [Bn,Bi2\  where  each  block  is  b-by-n/2k 

9:  if  i  <  j  then 

10:  %  Send  half  my  columns  to  my  neighbor  and  receive  half  of  his 

11:  Send  Bi2  to  processor  j 

12:  Receive  Bj\  from  processor  j 

13:  %  Apply  the  Householder  vectors  to  stacked  blocks  to  compute  input  to  next  level  of  butterfly 

14:  =  Apply-Householder-QT  ^  ^ 

15:  %  Lower  rows  are  not  acted  on  at  further  levels  in  butterfly,  so  can  be  returned  immediately 

16:  Send  Bk{  back  to  processor  j 

17:  else 

18:  Receive  Bj2  from  processor  j 

19:  Send  Bn  to  processor  j 

20:  =  Apply-Householder-QT  ( Yi  k,  ) 

Bi2\  _  \  ) 

21:  Receive  updated  rows  Bk{  back  from  processor  j 

22:  end  if 

23:  end  for 

24:  %  Propagate  computed  B  back  to  origins  via  recursive  doubling 
25:  for  k  =  logp  down  to  1  do 

26:  %  Determine  my  neighbor  in  this  level  of  the  butterfly 

27:  j  =  2k[^\  +  (i  +  2k~l  mod  2k ) 

28:  if  i  <  j  then 

29:  %  Collect  upper  rows 

30:  Receive  Bj  from  processor  j 

31:  Bi=[Bi,Bj\ 

32:  else 

33:  %  Collect  lower  rows 

34:  Send  Bi  to  processor  j 

35:  Bt  =  [BkvBk2\ 

36:  end  if 

37:  end  for 

38:  Set  the  first  b  rows  of  B{  to  Bi 

Ensure:  B  =  QT A  where  Q  is  the  orthogonal  matrix  implicitly  represented  by 
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Algorithm  11  [Y,  S]  =  Householder-QR-with-Hints(A,  R) 

Require:  A  is  m  x  b,  R  is  upper  triangular  such  that  R  =  QT A  for  some  orthogonal  Q 
1:  S  =  I 

2:  for  i  =  1  to  b  do 

%  Compute  the  Householder  vector  to  annihilate  sub-diagonal  entries  in  the  ith  column 
3:  a  =  A(i,  i) 

%  Compute  column  norm  from  diagonal  entry  of  R 
4:  (3  =  R[i ,  i) 

5:  if  sgn(/3)  =  sgn(a))  then 

6:  S(i,  i)  =  —1 

7:  13  = -13 

8:  end  if 

9:  r(i)  =  ^ 

10:  A(i  +  1  :  m,  i)  =  +  1  :  *) 

%  Apply  the  Householder  transformation  to  the  trailing  matrix 
%  Compute  z  from  the  ith  row  of  R 
11:  z  =  A(i,  i  +  1  :  n)  —  S(i,  i )  •  R(i,  i  +  1  :  n) 

12:  A(i  +  1  :  m,  i  +  1  :  n)  =  A(i  +  1  :  m,  i  +  1  :  n)  —  A(i  +  1  :  m,  i)  ■  z 

13:  end  for 

Ensure:  A  =  (EL'li^  ~  RVi'dl ))  SR 

Ensure:  Y  (the  Householder  vectors)  has  implicit  unit  diagonal  and  overwrites  the  strict  lower  triangle 
of  A ;  r  is  an  array  of  length  b  with  n  =  2/ ( yjyi ) 


In  order  to  compute  the  Householder  vector  ]ji,  we  must  compute  the  norm  of  :  m,i )  and  scale 

each  entry  below  the  diagonal  by  the  reciprocal  of  the  norm.  Computing  this  norm  requires  a  reduction, 
but  we  can  avoid  this  calculation  because  the  norm  is  given  by  the  absolute  value  of  the  diagonal  entry 
R(i,  i). 

Further,  after  the  Householder  vector  yi  is  computed,  the  orthogonal  reflector  is  applied  to  the  trailing 
matrix  in  the  form  of  a  rank-one  update.  This  update  is  given  by 

A«  =  (I-  rmyf)  A^-1)  =  A^  -  yt  fry? A^1))  =  A^  -  yiZi  (10) 

where  Zi  is  the  ith  row  of  Z,  yi  is  the  ith  column  of  Y,  and  r*  =  ^ — .  Note  that  the  ith  entry  of  yi,  or 
Y{i,i),  is  always  1.  Then  by  Equation  (10),  we  have 

R{i, :)  =  A(i)(i, :)  =  A(i_1)(*, :)  -  y^i)  ■  Zi  =  A^_1)(*, :)  -  Zi 

so  that  Zi  =  A^1) ( i , :)  —  R(i, :).  This  implies  that  Zi  can  be  computed  via  subtraction  rather  than  from  the 
formula  z-(  =  ryf  A^*-1),  thereby  avoiding  the  matrix-vector  product.  Algorithm  11  shows  the  Householder- 
QR-with-Hints  algorithm  which  incorporates  the  cheaper  means  of  computing  column  norms  and  vectors 
zt.  Close  comparison  of  Algorithm  11  with  Algorithm  1  will  show  the  computational  savings  from  lines  4 
and  11,  where  hints  from  R  are  used  to  compute  the  relevant  quantities  more  cheaply. 
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